Data loading issue - Githubissues

GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data

GNU General Public License v3.0

189 stars 24 forks source link

Data loading issue #312

Closed seongwoohan closed 2 years ago

seongwoohan commented 2 years ago

Hello, I am running BAMBU with my datasets. It looks like the error is saying the reference annotation isn't correct, but I used these three files in other long read packages and worked well. Is there anything I missed in the command line? I tried moving the files into the directory using system.file as the tutorial, but it fails with the same reasons. I converted fastq file to sam file using minimap2, sam file into bam file using samtools. I am using cDNA!

I used these command lines to convert fastq file to bam file.

./minimap2 -t 8 -ax splice /home/seong/R/x86_64-pc-linux-gnu-library/4.1/bambu/extdata/hg38.fa /data/long_read/ENCBS944CBA/ENCFF263YFG.fastq -o /data/long_read/ENCBS944CBA/ENCFF263YFG.sam

samtools view -@ 8 -Sb -o /data/long_read/ENCBS944CBA/ENCFF563QZR.bam /data/long_read/ENCBS944CBA/ENCFF563QZR.sam

andredsim commented 2 years ago

Hi there,

I see you have closed this issue. Can I assume you were able to resolve the issue?

seongwoohan commented 2 years ago

Hello, the main issue was due to minimap2 mapping. cDNA and direct RNA have different command line. Changing the command line solves the issue. Upon this one, I have one question regarding the final output, gtf file. I can see it's an "extended_annotations.gtf" that contains extended transcript & gene annotations for the genome using long reads data. Do you also have something like "transcript_models.gtf" with constructed transcript models (both known and novel transcripts) for output? If I have extended version, the gtf files contains more than enough information, including the genome annotation. It would be nice to have just constructed transcripts.

andredsim commented 2 years ago

Hi,

Just to make sure I understand your question, are you asking to have just the gtf output for transcripts that have read support in the sample, and any reference annotations that arn't expressed shouldn't be in the output gtf?

If so you can filter out the annotations that have no full-length read support.

constructedAnnotations = se[assays(se)$fullLengthCounts > 0]

If you only want the annotations that are novel and not found in the reference annotations:

novelAnnotations = se[mcols(se)$newTxClass != "annotation"]

And then output it as you normally do. Let me know if you have any problems with that.

andredsim commented 2 years ago

Oh one other thing I might recommend is using this branch https://github.com/GoekeLab/bambu/tree/BambuManuscriptRevision. It will be the next update and the documentation is more complete.