Allow import_counts to handle missing transcripts

kylepalos commented 1 year ago

Hello, Thank you for creating DTUrtle, I have found it very useful and approachable to use. However, I was wondering if the following issue that I commonly run into can be resolved. In brief: tximport can natively handle missing transcripts from the tx2gene file when importing Salmon files, e.g.,

reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 
transcripts missing from tx2gene: 262
summarizing abundance
summarizing counts
summarizing length

However, import_counts cannot complete and throws an error for this use-case:

Reading in 12 salmon runs.
Using 'dtuScaledTPM' for 'countsFromAbundance'.
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 
Error in medianLengthOverIsoform(length4CFA, tx2gene, ignoreTxVersion,  : 
  all(txId %in% tx2gene$tx) is not TRUE

Could this be resolved so that import_counts completes given incomplete tx2gene inputs, or added as an additional argument to import_counts?

Please let me know if you need any additional information or have any questions.

Thanks a lot, Kyle

Is your feature request related to a problem? Please describe. The problem is with the "import_counts" function throwing an error:

Error in medianLengthOverIsoform(length4CFA, tx2gene, ignoreTxVersion,  : 
  all(txId %in% tx2gene$tx) is not TRUE

When tx2gene does not contain all of the transcripts contained in the Salmon quantification files.

Describe the solution you'd like An argument added to "import_counts" which would enable this error to be a warning instead.

Describe alternatives you've considered I could re-run Salmon by removing the missing transcripts from the index, or remove the missing transcripts from each Salmon output file.

TobiTekath commented 1 year ago

Hi @kylepalos,

thank you very much for reaching out and the kind review of DTUrtle.

Sadly, the described error is actually not thrown by DTUrtle itself, but the tximport when using the "dtuScaledTPM" countsFromAbundance parameter.

You can validate yourself by running with replacing the dummies with your data:

tximport::tximport(files = "test_quant.sf", type = "salmon", countsFromAbundance = "dtuScaledTPM", tx2gene = test_tx2gene[,c("transcript_id", "gene_name")],  txOut = TRUE)

Thus, the feature request is beyond my scope.

Nonetheless, some ways to move forward:

You could switch to another count normalization procedure, that does not require tx2gene information (for example "scaled" for scaledTPM)
You could open a similar feature request in the tximport repo: https://github.com/mikelove/tximport
As you mentioned, you could modify either the quantification files or your tx2gene table. The latter might be an easier choice in my opinion (add the "missing" transcripts to your tx2gene table, read-in the data, remove the unwanted transcripts again).
At least from my experience, I have difficulties to think about a use-case, that would justify using different GTF-annotation files for the quantification and the data import via tximport. In a standard workflow the annotation should not differ for these two steps and thus this problem should not occur. If you have no specific reason why this special case is useful for you, you might reconsider to slightly adapt your approach.

Best, Tobias

kylepalos commented 1 year ago

Hi Tobias,

Thank you for your quick response. It makes sense that the issue is with tximport. I have never used "dtuScaledTPM" so I naively thought tximport was completing as usual.

Also, thank you for your recommendations on moving forward, I was able to proceed with the workflow.

Best, Kyle

TobiTekath / DTUrtle

Allow import_counts to handle missing transcripts #12