Gregor-Mendel-Institute / bookend

End-guided RNA assembler
MIT License
15 stars 1 forks source link

bookend for usage in nano pore data? #1

Closed hacaoe closed 2 years ago

hacaoe commented 2 years ago

Hey,

I was trying to use the bookend pipeline for nanopore sequencing data and couldn't really get it to work. I was able to generate an elr from my bam files (bookend elr worked), but assemble only returns an empty gtf file (no error reported too). Do you maybe know how this could happen and would it even be recommended to use bookend for Nanopore data?

maschon0 commented 2 years ago

Hello @hacaoe, By default, 'bookend assemble' will not output any incomplete transcript models (they require a start label, and end label, and no gaps). Also by default, 'bookend elr' looks for end information that is embedded in the read name by 'bookend label' upstream.

For Nanopore reads (direct RNA or cDNA), it may be safe to assume all your reads were 5'- and 3'-labeled. If you know this is true, then you can run 'bookend elr -s -e --stranded', where the three additional arguments label every read as containing a start and end tag, and as a strand-specific alignment (forward strand).

If your ELR file contains end labels, 'bookend assemble' should be able to process Nanopore reads. Does remaking the ELR file with these extra arguments resolve the problem?

hacaoe commented 2 years ago

Hey @maschon0,

I tried it with your suggestion, and assemble is now actually running seemingly (before it just output an empty file in around 7 seconds). The issue right now is that, assemble takes a long amount of time or just never gets completed. I ran it on both the terminal (didn't finish after multiple hours and then a crash occured) and my institutes server (assemble has been running for roughly 4 days now there). Do you perhaps know how this could happen?

maschon0 commented 2 years ago

Hi @hacaoe,

The assembly should definitely not take this long! I have a few additional questions about your dataset: 1) How many aligned Nanopore reads are in your BAM file(s)? 2) Are the BAM file(s) sorted by genomic position?

Can you share the first ~200 lines or so of the ELR file(s) that you are using as input to 'bookend assemble'? This will help me get a better idea of what is going wrong.

hacaoe commented 2 years ago

The bam file I am using for bookend elf has 2792411 reads in it and I sorted it via read name (iirc that was required for bookend elr to work). A txt file with the first 200 lines should be attached here w118_200.txt

maschon0 commented 2 years ago

Can you send more lines from this file? I would like to see some read alignments that are below the file header (after the lines beginning with #)

Try running 'bookend elr-sort -o [output_filename] [input_filename]' to sort the ELR file by position, then run assembly on the sorted file. Running assembly with the verbose option (-v) will also help show you the operations it is performing and the number of reads contained in each chunk.

hacaoe commented 2 years ago

Oh assemble works now! Not sorting the elrs via position must have been the issue then. Do you still want to see the rest of the lines? Thanks by the way for the help :) !

hacaoe commented 2 years ago

Ah, it may also be a stupid question, but is it normal that the 2nd column of the gif file just says bookend repeatedly?

maschon0 commented 2 years ago

I'm glad it's working now! Be sure to look through the different arguments for assembly (section 3.4 in the Bookend User Manual). The default settings assume short read RNA-seq data, but you might be able to find more optimal settings for Nanopore reads. I will push an update soon to make 'bookend elr' automatically write a sorted ELR file to avoid future confusion about this.

The second column of a GTF/GFF3 file is for the "source" of the annotation, which is conventionally the "name of the program that generated this feature, or the data source (database or project name)" (see https://ensembl.org/info/website/upload/gff3.html). Following convention, the second column of the file states that the annotation was generated by Bookend. I just noticed that the help text for 'bookend assemble' suggests that you can provide a different --source, but it's currently hardcoded.