Closed tom-brekke closed 2 years ago
Looks like the issue is gzipped reads. Unzip them and samba.sh
appears to work. It does seem that the docs allow for gzipped files so this seems like a bug.
Thanks @aakashsur - that's a good thought, but I while I had originally tried fastq.gz reads I have since unzipped them and get the above error with both fastq and fasta reads.
Any other ideas?
-Tom
I ran into this error when running the whole masurca pipeline with the flye option included. There were two things that I needed change. 1) Unzip the fastq file. 2) Provide an absolute path instead of a relative path.
The what(): First character must be a '>', got '?'
message suggests it's hitting the ?
which is often the first character of gzipped files. Otherwise I'm at a loss of what else it could be. I would make sure you have deleted the old run too as there are some files it looks for and won't bother remaking if they exist but the old files are usually wrong.
Hmm, I've tried all that and now re-checked it just all in case I missed something: I'm definitely using unzipped reads, I've given it the full path and I'm starting in a fresh directory and I still get the same error. I agree that the error suggests something is gzipped, but I can't find any gzipped files at all.
I've been picking through some of the output files and may have found a clue: the program errors out on a nucmer call where it's trying to align patches.raw.fa and $REFN.short.fa.
$REFN.short.fa seems fine: it's a normally-formatted and unzipped fasta of my genome, which presumably has been broken up at strings of Ns. No apparent problems there.
BUT patches.raw.fa is empty. I'm not sure why an empty file would present as a '?' but it seems like something has gone wrong if samba is trying to align an empty file.
So I've done some digging through samba.sh and as far as I can tell, patches.raw.fa is first mentioned in that nucmer call (line 226) - which may explain why it's empty.
Is there a missing line or call in the samba.sh script that comes with MaSuRCA 4.0.7 perhaps? Or does anyone know where patches.raw.fa is created?
Thanks in advance! -Tom
Ah okay wow this was a hard bug to find, I think this project could definitely benefit from some unit tests or integration tests.
I think it goes something like this -
1) Nucmer throws an error
terminate called after throwing an instance of 'std::runtime_error'
what(): First character must be a '>', got '?'
samba.sh: line 249: 62316 Aborted (core dumped) $MYPATH/nucmer -l 15 -b 400 --batch 10000000 -t $NUM_THREADS patches.raw.fa $REFN.short.fa
This turns out to be an obscure error that gets thrown if one of the fasta files is empty.
2) The empty fasta file is indeed patches.raw.fa. This should get created by extract_merges.pl
, but it outputs an empty file.
3) extract_merges.pl
outputs information to patches.raw.fa on line 200
if($#names==0){#no polishing seq -- put into raw
print RAW ">$name\n$qseq{$name}\n";
But it seems there might be an erroneous comment in the if statement. I don't work with perl so I can't say for sure, but at the very least removing it seems to fix the issue.
It does seem dubious to have to change something so far down the rabbit hole but let me know what you think.
Additionally it looks like after the fix samba still requires unzipped files, and although it will run without error with a fastq file, it appears to not run correctly, suggesting it also requires a fasta file.
Wow, thanks, that is indeed deep down the rabbit hole. One question before I test it out: The if-statement in extract_merges.pl is actually an if-else statement. So I'm not sure what you mean by 'removing it' - do you mean make it something like "if(1)" and just scrap everything in the else block?
Ah just changing if($#names==0)
to if($names==0)
. I assume it was a mistype or something.
On further exploration it looks like the $#
syntax is a perl operation to get the last value of the array. Equivalent to array[-1]
in python. I'm not sure anymore if the issue is with the scripts or if there simply wasn't a way to handle/skip the nucmer step if the patches.raw.fa
file was empty.
@alekseyzimin, any thoughts?
Okay I think instead of messing around with the perl scripts, it's probably just prudent to skip that section in samba.
if [ $ALN_DATA = "pbclr" ] || [ $ALN_DATA = "ont" ];then
$MYPATH/ufasta extract -f <($MYPATH/ufasta sizes -H $REF |awk '{if($2<250000) print $1}') $REF > $REFN.short.fa.tmp && mv $REFN.short.fa.tmp $REFN.short.fa && \
$MYPATH/nucmer -l 15 -b 400 --batch 10000000 -t $NUM_THREADS patches.raw.fa $REFN.short.fa && \
$MYPATH/delta-filter -r -l 200 out.delta | \
$MYPATH/show-coords -lcHr /dev/stdin | awk '{if($16>5 || $7>500) print $0}' | \
$MYPATH/reconcile_consensus.pl patches.raw.fa $REFN.short.fa > patches.cpolished.fa.tmp && \
mv patches.cpolished.fa.tmp patches.cpolished.fa && \
cat patches.cpolished.fa patches.polished.fa > $REFN.$QRYN.patches.polish.fa.tmp && \
mv $REFN.$QRYN.patches.polish.fa.tmp $REFN.$QRYN.patches.polish.fa
else
cat patches.raw.fa patches.polished.fa > $REFN.$QRYN.patches.polish.fa.tmp && \
mv $REFN.$QRYN.patches.polish.fa.tmp $REFN.$QRYN.patches.polish.fa
fi
My guess for the nucmer related section is that it is rescuing areas the polishing couldn't handle. I'm not super clear on what it is doing but everything ends up being concatenated to $REFN.$QRYN.patches.polish.fa
whether or not you go into the if statement.
I changed the line
if [ $ALN_DATA = "pbclr" ] || [ $ALN_DATA = "ont" ];then
to
if ([ $ALN_DATA = "pbclr" ] || [ $ALN_DATA = "ont" ]) && ([ -s patches.raw.fa ]);then
to check if the patches.raw.fa
file is a non-empty file. Samba seems to work from this change.
I suppose there probably needs to just be a new release to address the issue.
I've been playing around with it a bit with it and noticed that patches.polished.fa is also empty. So even skipping the nucmer bits for patches.raw.fa likely won't be enough to sort this out.
I'm restating samba.sh now from a clean run and once it fails I'll go through and rerun the scaffold links section (starting at line 210ish) one command at a time. Hopefully that will make it clear what's going wrong. I'll keep you posted. -Tom
Thank you for the bug report. You are doing everything right, and this is indeed a bug. I fixed it in 4.0.8 release that is being tested now and will be available later today. Thank you for suggesting the change: if ([ $ALN_DATA = "pbclr" ] || [ $ALN_DATA = "ont" ]) && ([ -s patches.raw.fa ]);then
New release 4.0.8 published. Please let me know here if it works for you.
Hello @alekseyzimin - samba.sh v4.0.8 finished successfully. Thank you! Thanks also for all the troubleshooting @aakashsur -Tom
Thank you Tom for reporting and @aakashsur https://github.com/aakashsur for suggesting a fix. There was also an additional possible weakness that I fixed as well.
On Sat, Feb 12, 2022 at 12:52 PM Tom Brekke @.***> wrote:
Hello @alekseyzimin https://github.com/alekseyzimin - samba.sh v4.0.8 finished successfully. Thank you! Thanks also for all the troubleshooting @aakashsur https://github.com/aakashsur -Tom
— Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/274#issuecomment-1037349104, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHM6NX2F5Q5C5DESQD3U22M7PANCNFSM5M544IUQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com
Hello,
I am trying to use samba to improve scaffolding of a mammalian genome with some ONT reads. Here is my call:
samba.sh -r $GENOME -q $READS -d ont -t 10 -a merges.file.txt > samba.out.txt
where $GENOME and $READS contain the location of my genome and fasta-formatted reads respectively (I get the same error if I use fastq-formatted reads).
The error is this:
terminate called after throwing an instance of 'std::runtime_error' what(): First character must be a '>', got '?' /apps/genomics/masurca/4.0.7/el7/AVX512/gnu-7.3/bin/samba.sh: line 238: 84924 Aborted $MYPATH/nucmer -l 15 -b 400 --batch 10000000 -t $NUM_THREADS patches.raw.fa $REFN.short.fa [Tue 25 Jan 17:46:31 GMT 2022] links consensus failed
If you have any suggestions for me, I would really appreciate it. Thank you! -Tom