Generade-nl / TULIP

TULIP - The Uncorrected Long read Itegration Pipeline
27 stars 4 forks source link

Warning/Errors in tulipbulb.perl #1

Open ctxchris opened 7 years ago

ctxchris commented 7 years ago

Hi,

I did the following:

bwa mem -x pacbio pacbioRaw.fasta illumina.fasta |cut -f 1,2,3,4,5,6 >pacbioRaw_TO_Illumina.sam
tulipseed.perl --seedlength 385 --diploid --sam pacbioRaw_TO_Illumina.sam --out illumina.TULIP
tulipbulb.perl --seeds  illumina.fasta --reads pacbioRaw.fasta --sam pacbioRaw_TO_Illumina.sam --input illumina.TULIP --diploid --allreads  --out illumina.TULIP.bundle

tulipseed.perl seems to have run successfully. The last lines say:

Indexing scaffolds 79763 scaffolds found Scaffold N50 = 6448 Scaffold N50 seeds = 3 Total scaffold length = 242165663 nt containing 248322 seeds Writing scaffold information to illumina.TULIP.scaffolds and illumina.TULIP.scaffolds_stats Writing temporary scaffold files to illumina.TULIP.scaffolds_tmp and illumina.TULIP.seeds_tmp Writing graph to illumina.TULIP.graph and illumina.TULIP.graph_tmp Run ended

Running tulipbulb.perl produced the following warnings/errors:

--haploid and --diploid are not compatible

Even though only --diploid is passed

Several thousand time I see these:

Use of uninitialized value in concatenation (.) or string at tulipbulb.perl line 543, line 52583. Use of uninitialized value $sequence in substr at tulipbulb.perl line 568, line 52583. substr outside of string at /ctx/software/TULIP/tulipbulb.perl line 568, line 52583. Use of uninitialized value $seq in string eq at /ctx/software/TULIP/tulipbulb.perl line 688, line 52583. Use of uninitialized value $seq in transliteration (tr///) at tulipbulb.perl line 689, line 52583. Use of uninitialized value $seq in reverse at tulipbulb.perl line 690, line 52583.

The corresponding lines in the seeds file illumina.fasta seem always to be header lines:

awk 'NR==52583{print}' illumina.fasta
>scaffold16792

In the pacbio data they correspond to sequences:

awk 'NR==52583{print}'  pacbioRaw.fasta
CTTCCTTCTGTCCGTCATCTTTCAAAGCCCATATTATTTATTTTTTATGGTGGAATAGATAAAGATATTC

In my working folder there are several thousand files named like this:

illumina.TULIP.bundle_readbundle_46242.fasta

And they all only contain the pipe symbol:

cat illumina.TULIP.bundle_readbundle_48915.fasta
>

>

>

Do you have an idea what could have gone wrong?

Thank you Chris

Generade-nl commented 7 years ago

Hi Chris,

Thanks for reporting this!

The --diploid issue is a silly bug, the option is currently not even used in tulipbulb.perl, apologies for that.

The other issue may have to do with the script trying to read the SAM file and the PacBio reads simultaneously, i.e. trying to find the read matching the alignment. For this, both files need to be more or less in the same order (by read), which will happen when using BWA. The 'more or less' allows for some local shuffling due to alignment multithreading (BWA not reporting alignments in exactly the order they appear in the reads FASTA). By default, we read 5000 lines ahead to catch this (line 22, $fbuffermax). I can reproduce your error by setting $fbuffermax really low (5) and using 4 alignment threads. This needs a better fix!

If you want to try playing with this, I recommend using the --nobundles setting, tulipbulb then won't open thousands of files but just one for the scaffold sequences.

Alternatively, perhaps tulipbulb does not recognize your FASTA headers at all, and proceeds right to the end of the reads file trying to find a header? Could you perhaps share what your PacBio FASTA headers look like, and how they end up in the SAM file?

Also, not sure which genome or how much data you used, but those tulipseed stats worry me: very many scaffolds, and each consists of just a few seeds. We got thousands of seeds per scaffold.

Chris

dvera commented 7 years ago

Are you reversing your index and query in bwa mem? According to docs, you align your pacbio reads to your illumina-based seeds, and not the other way around.