LabTranslationalArchitectomics / riboWaltz

optimization of ribosome P-site positioning in ribosome profiling data
MIT License
46 stars 12 forks source link

Error: $ operator is invalid for atomic vectors during bamtolist #37

Closed quarkaffi closed 3 years ago

quarkaffi commented 3 years ago

Hi,

I experienced an error during the bamtolist function and I don't understand what is causing it. I saw in other issues before that it can be caused by different annotations in the bam alignment and in the annotation file created with the create_annotation. I am using the ref genome and gtf from Ensemble for C. elegans (WBcel235).

This is how my output of create_annotation looks like transcript l_tr l_utr5 l_cds l_utr3 1: 2L52.1a.1 1357 19 1284 54 2: 2L52.1b.1 789 126 663 0 3: 2L52.2 151 0 0 0 4: 2RSSE.1a.1 1301 269 1032 0 5: 2RSSE.1b.1 1706 269 1437 0 6: 2RSSE.1c.1 1046 269 777 0

and this is my STAR alignment output with --quantMode TranscriptomeSAM

NB500982:360:HW5NVBGXH:1:11110:2371:2070_CTTTGGGA 256 F56F3.5.1 662 3 42M 0 0 TAAGGTCAAGATCATCAAGAGACAAAAGGTTGATCTTGGACG AAAEE/EEE6EEEEEAEEEEEE//EEEE/AE/EE/E/E/EE/ NH:i:2 HI:i:2 NB500982:360:HW5NVBGXH:1:11110:6828:2079_AAACATGA 0 C23H4.6b.1 3095 3 20M 0 0 ACGTTTTCATGGATATGATG AAAEEEEEEEEEEEEEEEEENH:i:2 HI:i:1 NB500982:360:HW5NVBGXH:1:11110:6828:2079_AAACATGA 256 C23H4.6a.1 3245 3 20M 0 0 ACGTTTTCATGGATATGATG AAAEEEEEEEEEEEEEEEEENH:i:2 HI:i:2 NB500982:360:HW5NVBGXH:1:11110:18755:2079_CTTCGCTA 0 F10B5.1.1 619 3 28M 0 0 CTCACCAAGTGGATTGAGAACCCAATCT AAAEEEEEEEEEEEEEEEEEEEEEEEEE NH:i:2 HI:i:1

The command i use reads_list <- bamtolist(transcript_align = "FALSE", bamfolder = "bamfiles", annotation = "annotation.txt")

The error I get Good! Number of indel below indel_threshold for all reads. No reads removed. Error: $ operator is invalid for atomic vectors

Thanks in advance

fabiolauria commented 3 years ago

Hi, you are right, usually issues about bamtolist are due to annotation mismatches. To be sure everything is ok, try to run on one of your BAM files:

sample_dt <- as.data.table(GenomicAlignments::readGAlignments("bam_file.bam"))

and then

length(unique(as.character(sample_dt$transcript)))
length(annotation_dt$transcript)
length(intersect(unique(as.character(sample_dt$transcript)), annotation_dt$transcript))

where annotation_dt is the output of _createannotation. You should be able to check if and how many transcript names in annotation match the transcript names in the BAM file.

Tip: when running a command like reads_list<- bamtolist(transcript_align = "FALSE", bamfolder = "bamfiles", annotation = "annotation.txt") logical strings such as FALSE don't need to be put within quotation marks. And be sure bamfiles is the path to the folder where BAM files are stored.

Let me know what is going on.

Best Fabio

quarkaffi commented 3 years ago

Hi Fabio,

thanks for your quick reply.

So i loaded the Bam into sample_dt, looking like: seqnames strand cigar qwidth start end width njunc 1: Y48G9A.3.1 + 28M 28 2351 2378 28 0 2: W06A7.3g.3 + 25M 25 1485 1509 25 0 3: W06A7.3g.1 + 25M 25 1483 1507 25 0 4: W06A7.3a.1 + 25M 25 791 815 25 0 5: W06A7.3f.1 + 25M 25 10011 10035 25 0 6: W06A7.3c.1 + 25M 25 9999 10023 25 0

length(unique(as.character(sample_dt$transcript))) prompts [1] 0

as the transcripts are called "seqnames" instead of "transcripts" (why is that?)

changing it to seqnames:

length(unique(as.character(sample_dt$seqnames))) [1] 33173

The annotation file with length(annotation_dt$transcript)

length(annotation$transcript) [1] 61451

equally:

length(intersect(unique(as.character(sample_dt$transcript)), annotation$transcript)) [1] 0

length(intersect(unique(as.character(sample_dt$seqnames)), annotation$transcript)) [1] 33173

Best, Martin

fabiolauria commented 3 years ago

as the transcripts are called "seqnames" instead of "transcripts" (why is that?)

Sorry for the oversight. The names of the columns are assigned by the function GenomicAlignments::readGAlignments and they decided to use "standard" names, independent from the experimental design and analyses of interest.

BTW since all seqnames are found in the annotation table, it's not an issue of matching names. Additional question to try to understand what's the problem:

quarkaffi commented 3 years ago

Hi Fabio,

bamfiles if the folder where my (4) bamfiles are located, which a folder in the working directory.

annotation.txt is the output saved from create_annotation, that i load in with as.data.table before feeding it into the script looking like

head(annotation.txt) transcript l_tr l_utr5 l_cds l_utr3 1: 2L52.1a.1 1357 19 1284 54 2: 2L52.1b.1 789 126 663 0 3: 2L52.2 151 0 0 0 4: 2RSSE.1a.1 1301 269 1032 0 5: 2RSSE.1b.1 1706 269 1437 0 6: 2RSSE.1c.1 1046 269 777 0

Best, Martin

quarkaffi commented 3 years ago

It worked, you were right. Reading the file in from a txt caused the issues.

Thank you very much.

fabiolauria commented 3 years ago

Glad to read everything is fine.

I'm going to close this issue, but feel free to contact me again at any moment.

Best Fabio