RabadanLab / arcasHLA

Fast and accurate in silico inference of HLA genotypes from RNA-seq
GNU General Public License v3.0
113 stars 49 forks source link

gene without cDNA sequence causes error when running quant.py #102

Open adeschen opened 1 year ago

adeschen commented 1 year ago

Hi,

The quantification step is crashing for some of our samples. The error is always the same:

Traceback (most recent call last):
  File "/grid/tuveson/home/software_elzar/bin/miniconda/miniconda3_v4.12.0/miniconda3/envs/arcasHLA-v0.5.0_db3.47.0/share/arcas-hla-0.5.0-0/scripts/quant.py", line 250, in <module>
    allele_results[allele_id[:-1]]['allele' + allele_id[-1]] = allele

There is an easy fix to this problem. The current quant.py core (line 250):

    for allele_id, allele in genotype.items():
        allele_results[allele_id[:-1]]['allele' + allele_id[-1]] = allele

can be modified to check if the key exists and if not, create one:

   for allele_id, allele in genotype.items():
    if not allele_id[:-1] in allele_results:
        allele_results[allele_id[:-1]] = defaultdict(float)
    allele_results[allele_id[:-1]]['allele' + allele_id[-1]] = allele

That modification removes the error but don't remove the cause of the problem. The quant.py script uses the .p file to populate the allele_results dictionary. The .p contains both the genes and the genotype dictionaries. The genes dictionary is used by the quant.py script to create the keys of the allele_results dictionary. However, it appears that the genes dictionary is something missing genes that have alleles in the genotype dictionary.

It seams the cause of this problem is related to the lack of cDNA sequence for some allele as managed in the customize.py script. The script generates the .p file that is going to be used by the quant.py script. Both the genes and genotype dictionaries are saved in the .p file. However, the genes dictionary won't contains the genes that are missing cDNA sequences for both alleles (line 137 of customize.py) as the condition for the loop is not respected:

        for seq in sequences:
            hla_idx[allele_id].append(str(idx))
            genes[gene].append(allele_id)

Can you address this issue.

Best regards Astrid