adamewing / tldr

Identify and annotate TE-mediated insertions in long-read sequence data
MIT License
40 stars 4 forks source link

Empty output of the .table.txt file #10

Closed sunyumail93 closed 3 years ago

sunyumail93 commented 3 years ago

Hello @adamewing,

Thank you very much for creating this wonderful tool! I am recently running it on our Nanopore sequencing data. I used the following command to run:

tldr -b Data.bam -e TE.fa -r genome.fa -p 20 -o Output -n Insert.bed6.gz --color_consensus --detail_output --extend_consensus 20000

I didn't get any error when running the script, however, the final Output.table.txt file was empty (only the header line). From the messages printed by the software, I can see: .. writing clusters to Output/chrxxx.pickle Then it starts to load clusters: loaded xxx clusters from Output/chrxxx.pickle Finally write output: finished Output/chrxxx.pickle. wrote 0 records to Output/chrxxx.pickle.

It was a little strange that all chromosomes had 0 records written. Could you please provide me some suggestions on this problem? Our data is ~35X coverage of the genome.

Thank you! Harry

adamewing commented 3 years ago

Hi Harry,

Yep, that's probably not right. A couple questions to start troubleshooting: how the reads were aligned (minimap2? parameters?), Is the Genome.fa the same reference as was used to create the .bam file?

--Adam

sunyumail93 commented 3 years ago

Hi Adam

Thank you so much for your quick response! I am happy to provide more details about the run:

Since we work with Nanopore data, I used minimap2 to align fastq sequences to the chicken genome galGal6: minimap2 -a -x map-ont galGal6.fa Data.fastq | samtools sort -T tmp -o Data.sorted.bam

Then I used NGMLR and Sniffles to call structural variations and converted the sites to bed6. I noticed that in the example file, column 2 and column 3 numbers should be same. I used tabix to index the bed6.gz file. Since I guess this step may cause the error, I tried to run tldr with and without -n input, but I got same results: There is only one header line in the .table.txt file. I confirm that the genome fasta file is same when running mapping and tldr. After running, the output directory is also empty. Those intermediate files were all gone.

I attached the output of the run: https://de.cyverse.org/dl/d/E7A295D3-021C-455F-8148-A6F09AACF801/TLDR4.1_123723052415163_err_2895733.txt

Thanks a lot for helping!

Harry

adamewing commented 3 years ago

Hi Harry,

No worries, good to see it used on different organisms as it's only been tested extensively in-house with human and mouse so far.

The -n file isn't important for getting output, it just annotates the "NonRef" column. The intended use there is if you have a list of known insertions in the population that aren't in the reference. You could also use it to intersect with SV calls from elsewhere as you're doing.

It looks like there's clusters but nothing begin called, so a possible issue is alignment between the insertions it's finding and your .fasta file. Could you pull the latest from the githib repo and re-run with --debug enabled? (fair warning, this will produce a lot if output).

--Adam

sunyumail93 commented 3 years ago

Hi Adam,

Thank you for the suggestions!

I have checked that the software is the latest version v1.1.

I ran again with the --debug parameter and please see the output. The result files were same as before, with only a header line.

https://de.cyverse.org/dl/d/368FB7D1-AEDA-430B-B1B3-ED75E42831ED/TLDR4.1_123723052415163_err_2896749.txt

Harry

sunyumail93 commented 3 years ago

I also attached the TE sequences:

https://de.cyverse.org/dl/d/E6F63289-F993-48B5-9184-65459C915E28/galGal6.repBase.20201213.TLDR.fa

adamewing commented 3 years ago

Thanks, sorry, I meant pull the most recent changes (commit 33274b8) as this should increase the verbosity from --debug and give me an indication of whether the TE library is aligning.

sunyumail93 commented 3 years ago

No problem~ I will let you know when I get the results. Thanks a lot!

sunyumail93 commented 3 years ago

Please see the rerun results using the updated software: https://de.cyverse.org/dl/d/0FD12FDA-049A-497B-AC30-FBDF525391B9/TLDR4.1_123723052415163_err_2913241.txt

I also noticed that you provided a test data, however running the test data generated an error here: https://de.cyverse.org/dl/d/6BD61C34-570C-4349-9058-D281D8016B56/tldrrun_15293959311677_err_testdata.txt

Hope these information are helpful! Thank you!

adamewing commented 3 years ago

Hmm, this looks like you're missing a dependency somewhere. Can you verify that the following tools are available in the same environment where you're running tldr: minimap2 samtools mafft exonerate

Also, does python setup.py install throw any errors?

Thanks for your patience.

sunyumail93 commented 3 years ago

Thank you for the suggestions! Actually I did realize the samtools didn't work well due to a module conflict when I ran tldr on our cluster. After fixing this issue, all four softwares can be loaded properly, and the test run generated correct results by reporting one insertion belong to L1 family. There were no error when installing the software.

While it puzzled me that none my 6 chicken nanopore data generated any results in .table.txt. I wonder whether this was because the insertion sequences don't match well with the known transposons? If the transposon was truncated during transposition, would you think tldr can detect it?

Thank you for your help!

sunyumail93 commented 3 years ago

Hello~ I would like to share with you some debugging progress on my side. Basically, I tried to combine my data with your test data and run, but still no output generated. So I doubt that my own files caused the error.

Then I tried to narrow down the files, and it seems that my TE file caused the problem. I identified at least one TE that caused the tldr blank output. If possible, could you please take a look whether this also happens on your side?

I added the following TE sequence to your teref.human.fa file to make the teref.human.adding.fa file

TransposableElement:Eulor9A caaaggaaagtaaaatgcaaaaaatcctacgttaatgcaacgttacggttgagattttaaacgcacaaaagtcaggaaattcaaagttacggttcccacagcaaccgtaactcggccacattgcacatactgatattaaggcataaatttaacaxcatatacagtaatacaaaatacttctatgcgcaagggggccgagttacggttgccgtgggaaccgtaactttgaatttcctgacttttgtgcgtttaaattctcaaccataatgttgcattaacgtagttttttgcatttaatxta

Then I ran the command: tldr -b test.bam -e teref.human.adding.fa -r ref.fa -o testrun --color_consensus --detail_output --extend_consensus 20000 --debug

Then the output became nothing. You may have noticed that this sequence contains two 'x' nucleotides. I downloaded the sequence from here (https://www.girinst.org/protected/repbase_extract.php?access=Eulor9A) so I don't know why two 'x's were in the sequence, but I think I figured out the problem finally. After changing them to 'n', they software worked well.

Again, thank you for creating this wonderful tool, and thank you for all the suggestions!

adamewing commented 3 years ago

Ah, interesting... I've made a change to check and correct for this in the future in 6f1bdfc. Many thanks!