Calculate usage of IPA sites

cloudyrainy-zh commented 6 months ago

In the third step, InPACT_quantify is running with the following error, and I would like to know which file is at fault

Traceback (most recent call last): File "/home/data/miniconda3/envs/InPACT/bin/InPACT_quantify", line 103, in ipa_usage_dic = calculate_usage(transcript_tpm_path, File "/home/data/miniconda3/envs/InPACT/bin/InPACT_quantify", line 48, in calculate_usage gene_tpm_dic[gn] = sum([transcript_tpm_dic[transcript] for transcript in transcripts]) File "/home/data/miniconda3/envs/InPACT/bin/InPACT_quantify", line 48, in gene_tpm_dic[gn] = sum([transcript_tpm_dic[transcript] for transcript in transcripts]) KeyError: 'NM_000350.3'

Thank you very much.

liuxc1995 commented 6 months ago

Hello! Thank you for your question.

For successful execution of the InPACT_quantify step, several key files are indispensable:

predict.result.txt: This file is generated through the execution of InPACT command.
merged.gtf: This file is derived from the InPACT_transcript command.
quant.sf: Produced by Salmon, this file is essential for downstream analyses. It is important to note that the index for the transcriptome used in Salmon should be constructed based on the merged.gtf file.

To ensure the smooth execution of the InPACT_quantify command, it is recommended to meticulously verify the integrity and compatibility of these input files.

Thank you, Liu

Badman1025 commented 2 months ago

Hello, I encountered the same issue. The first step outputted a file named predict.result.txt, and the second step generated a merged.gtf. I believe these two steps should be correct, but I also encountered an error in the third step:

Traceback (most recent call last): File "/root/miniconda3/envs/InPACT/bin/InPACT_quantify", line 103, in ipa_usage_dic = calculate_usage(transcript_tpm_path, File "/root/miniconda3/envs/InPACT/bin/InPACT_quantify", line 48, in calculate_usage gene_tpm_dic[gn] = sum([transcript_tpm_dic[transcript] for transcript in transcripts]) File "/root/miniconda3/envs/InPACT/bin/InPACT_quantify", line 48, in gene_tpm_dic[gn] = sum([transcript_tpm_dic[transcript] for transcript in transcripts]) KeyError: 'NM_001080438.1' Even though the quant.sf file may output different names due to the fa file: hg38.fa： Name Length EffectiveLength TPM NumReads chr1 248956422 248956133.387 66.538852 2386248.779 chr10 133797422 133797133.387 43.224727 833099.075 chr11 135086622 135086333.387 73.907179 1438187.886 chr11_KI270721v1_random 100316 100027.387 0.000000 0.000 chr12 133275309 133275020.387 59.799460 1148057.411 Homo_sapiens.GRCh38.cdna.all.fa： Name Length EffectiveLength TPM NumReads ENST00000631435.1 12 12.000 0.000000 0.000 ENST00000415118.1 8 8.000 0.000000 0.000 ENST00000448914.1 13 13.000 0.000000 0.000 ENST00000434970.2 9 9.000 0.000000 0.000 others： Name Length EffectiveLength TPM NumReads NM_001080438.1 248956422 248956133.3 66.538614 2386248.409 NM_001080438.1 133797422 133797133.3 43.224636 833100.172 NM_001080438.1 135086622 135086333.3 73.906931 1438187.977

No matter which one I use, I get the same error!!! Where could the problem be? Therefore, I want to first confirm the format of the quant.sf file, which is very important. Secondly, please help me check if there is any issue with the merged.gtf.

The merged.gtf was generated from hg38.fa, using other fa files would cause errors. chr1 ncbiRefSeq.2021-12-08 transcript 33306766 33321098 . - . gene_id "A3GALT2"; transcript_id "NM_001080438.1"; gene_name "A3GALT2"; chr1 ncbiRefSeq.2021-12-08 exon 33306766 33307453 . - . gene_id "A3GALT2"; transcript_id "NM_001080438.1"; exon_number "5"; exon_id "NM_001080438.1.5"; gene_name "A3GALT2"; chr1 ncbiRefSeq.2021-12-08 stop_codon 33306766 33306768 . - 0 gene_id "A3GALT2"; transcript_id "NM_001080438.1"; exon_number "5"; exon_id "NM_001080438.1.5"; gene_name "A3GALT2"; chr1 ncbiRefSeq.2021-12-08 CDS 33306769 33307453 . - 1 gene_id "A3GALT2"; transcript_id "NM_001080438.1"; exon_number "5"; exon_id "NM_001080438.1.5"; gene_name "A3GALT2";

Thank you very much.

liuxc1995 commented 2 months ago

The error likely stems from an issue during the Salmon run. To resolve it, follow these steps:

First, ensure you have a set of target transcripts for quantification. Extract the transcript sequences from the merged.gtf file using gffread. gffread merged.gtf -g <genome.fa> -w transcripts.fa

Next, build the transcriptome index with Salmon. salmon index -t trasncripts.fa -i <index> -k 31

After building the index, you can quantify against this index using Salmon for accurate quantification.

I think this process should help avoid errors.

Badman1025 commented 2 months ago

In the response mentioned at https://github.com/YY-TMU/InPACT/issues/10, I used the hg38.fa file. I attempted the step with the command "gffread merged.gtf -g hg38.fa -w transcripts.fa", during which the following error occurred:

Warning: couldn't find fasta record for 'chr10_KN196480v1_fix'! Error: no genomic sequence available (check -g option!). Despite the error, I reviewed the output results. Here is the head of the transcripts.fa file:

NM_001005484.2 CDS=61-1041 CCCAGATCTCTTCAGTTTTTATGCCTCATTCTGTGAAAATTGCTGTAGTCTCTTCCAGTTATGAAGAAGG TAACTGCAGAGGCTATTTCCTGGAATGAATCAACGAGTGAAACGAATAACTCTATGGTGACTGAATTCAT ... I decided to use this erroneous result to try subsequent operations and build the index. Here is the head of the quant.sf file:

Name Length EffectiveLength TPM NumReads NM_001005484.2 2618 2326.487 0.000000 0.000 NM_001005221.2 939 647.817 0.000000 0.000 ... The name format of the other transcript output results is consistent:

Name Length EffectiveLength TPM NumReads NM_001080438.1 248956422 248956133.3 66.538614 2386248.409 NM_001080438.1 133797422 133797133.3 43.224636 833100.172 NM_001080438.1 135086622 135086333.3 73.906931 1438187.977 The result is still the same error:

Traceback (most recent call last): File "/root/miniconda3/envs/InPACT/bin/InPACT_quantify", line 103, in ipa_usage_dic = calculate_usage(transcript_tpm_path, File "/root/miniconda3/envs/InPACT/bin/InPACT_quantify", line 48, in calculate_usage gene_tpm_dic[gn] = sum([transcript_tpm_dic[transcript] for transcript in transcripts]) File "/root/miniconda3/envs/InPACT/bin/InPACT_quantify", line 48, in gene_tpm_dic[gn] = sum([transcript_tpm_dic[transcript] for transcript in transcripts]) KeyError: 'NM_012199.5' Finally, please upload the final ipa_usage.txt format for reference.

Thank you.

liuxc1995 commented 2 months ago

We were unable to reproduce the error with gffread on our end. I recommend trying the command with different annotation files to verify the command. If the problem persists, feel free to send your complete code and files to my email (liuxc1995@tmu.edu.cn), and we’ll be happy to assist with troubleshooting. And the final ipa_usage.txt format is in README part.

YY-TMU / InPACT

Calculate usage of IPA sites #6