GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
189 stars 24 forks source link

how to find relavent gene of incompatible aligned reads? #360

Closed icanccwhite closed 1 year ago

andredsim commented 1 year ago

Hi,

The gene ids for incompatible read counts can be found in the metadata of the output summarized experiment object. metadata(se)$incompatibleCounts

If you want to track exactly which reads are aligned as incompatible it is a bit more convoluted. First you need to run Bambu with trackReads = TRUE, discovery = FALSE, quant = FALSE rcFile <- bambu(reads = sample, annotations = annotations, genome = fa.file, discovery = FALSE, quant = FALSE, returnDistTable = TRUE, trackReads = TRUE) Then you need to separately finish discovery and quantification with returnDistTable = TRUE(note that you can use the output from the first command for the reads argument) se <- bambu(reads = rcFile, annotations = annotations, genome = fa.file, returnDistTable = TRUE Then you can look for the incompatible group of reads in the distTable of the output which will have the annotationTxId of [GeneID]unidentified For example if I wanted to look for the incompatible aligned reads for ENSG00000181404 in our test dataset that comes with Bambu. (Note that this is for a single sample run. If you have multiple samples, you need to replace [[1]] with the sample name or appropriate index.)_ readClassIds = metadata(se)$distTable[[1]]$readClassId[which(metadata(se)$distTable[[1]]$annotationTxId=="ENSG00000181404_unidentified")] This will return a vector of read class ids.e.g [1] "rc.5" "rc.2" "rc.4" "rc.7" "rc.8" "rc.9" "rc.10" You then need to find the read indexes associated with these read classes. readIndexes = unlist(rowData(rcFile[[1]])[readClassIds,]$readIds) [1] 6 1 5 4 8 9 10 14 Then you can find the real read ids using this metadata(rcFile[[1]])$readNames[readIndexes] [1] "4fe6307f-f77e-45fa-90ee-8f6b0c8677b9" "a9ec59c8-e070-4c13-87f1-d6f12023c31d" [3] "e00a4dae-fc4b-46f6-b330-13363df4d677" "4f38e2ab-b7bb-4a78-aaab-4e9e8efa50f6" [5] "6cbac604-33d3-4123-90de-13ad8fa7a9b7" "4eec153a-9c9b-4c90-bbbd-ef3be212aebb" [7] "7c5e6416-99cb-435f-8e7f-d9e18bc74579" "9a2443fb-ed14-4001-bdc3-de5e35f18267"

If you have a read id and you want to find which group it is associated to, you need to do the above steps but in reverse.

Hope this helps and feel free to expand on your query if I did not cover it.

Kind Regards, Andre Sim

icanccwhite commented 1 year ago

hi, another question, are there reads-assigned ids that could be searched after the EM algorithm(demultiplex of the multi-aligned transcripts)?

cying111 commented 1 year ago

Hi @icanccwhite , am I correct that you are asking for the proportionate assignment from read to transcript after EM?

icanccwhite commented 1 year ago

Hi @icanccwhite , am I correct that you are asking for the proportionate assignment from read to the transcript after EM?

hi @cying111, The reads id of the assigned transcript after EM. Is it possible? I should admire that the bambu quantification is better for the BC level matrix, I have tested. thx for your help!

cying111 commented 1 year ago

hi @icanccwhite , thanks for the confirmation. Unfortunately, we only report read to transcript assignment before EM (https://github.com/GoekeLab/bambu#Tracking-read-to-transcript-assignment). We will consider to report this for future release.

icanccwhite commented 1 year ago

Hi,

The genforids for incompatible read counts can be found in the metadata of the output summarized experiment object. metadata(se)$incompatibleCounts

If you want to track exactly which reads are aligned as incompatible it is a bit more convoluted. First you need to run Bambu with trackReads = TRUE, discovery = FALSE, quant = FALSE rcFile <- bambu(reads = sample, annotations = annotations, genome = fa.file, discovery = FALSE, quant = FALSE, returnDistTable = TRUE, trackReads = TRUE) Then you need to separately finish discovery and quantification with returnDistTable = TRUE(note that you can use the output from the first command for the reads argument) se <- bambu(reads = rcFile, annotations = annotations, genome = fa.file, returnDistTable = TRUE Then you can look for the incompatible group of reads in the distTable of the output which will have the annotationTxId of [GeneID]unidentified For example if I wanted to look for the incompatible aligned reads for ENSG00000181404 in our test dataset that comes with Bambu. (Note that this is for a single sample run. If you have multiple samples, you need to replace [[1]] with the sample name or appropriate index.)_ readClassIds = metadata(se)$distTable[[1]]$readClassId[which(metadata(se)$distTable[[1]]$annotationTxId=="ENSG00000181404_unidentified")] This will return a vector of read class ids.e.g [1] "rc.5" "rc.2" "rc.4" "rc.7" "rc.8" "rc.9" "rc.10" You then need to find the read indexes associated with these read classes. readIndexes = unlist(rowData(rcFile[[1]])[readClassIds,]$readIds) [1] 6 1 5 4 8 9 10 14 Then you can find the real read ids using this metadata(rcFile[[1]])$readNames[readIndexes] [1] "4fe6307f-f77e-45fa-90ee-8f6b0c8677b9" "a9ec59c8-e070-4c13-87f1-d6f12023c31d" [3] "e00a4dae-fc4b-46f6-b330-13363df4d677" "4f38e2ab-b7bb-4a78-aaab-4e9e8efa50f6" [5] "6cbac604-33d3-4123-90de-13ad8fa7a9b7" "4eec153a-9c9b-4c90-bbbd-ef3be212aebb" [7] "7c5e6416-99cb-435f-8e7f-d9e18bc74579" "9a2443fb-ed14-4001-bdc3-de5e35f18267"

If you have a read id and you want to find which group it is associated to, you need to do the above steps but in reverse.

Hope this helps and feel free to expand on your query if I did not cover it.

Kind Regards, Andre Sim

Hi Andre, Many thanks for the code details. It works well, but why there are duplicate reads in the gene incompatible counts part?

Kr, Xiaoyu

andredsim commented 1 year ago

Hi Xiaoyu,

Very sorry I didn't see you question and its been a very long time since you asked. I hope you found an answer in the mean time. Unfortunately I do not have an answer as I am not sure what you mean by duplicate reads? Perhaps you could include an example as all the reads in the code I provided are unique?

Given the time I will move this over to the discussion as I think this will be useful for other users but if your question is still open feel free to reopen the issue.

Kind Regards, Andre Sim