ANHIG / IMGTHLA

Github for files currently published in the IPD-IMGT/HLA FTP Directory hosted at the European Bioinformatics Institute
http://www.ebi.ac.uk/ipd/imgt/hla/
Other
207 stars 61 forks source link

Differences between hla_nuc and hla_gen #382

Closed ahmadalajami closed 3 months ago

ahmadalajami commented 3 months ago

Hi there,

I am trying to quantify a particular allele in a scRNA-seq dataset. I found this allele A*03:04:01 in Allelelist.txt and hla_nuc.fasta, but not in hla_gen.fasta. Which sequence do you suggest using when quantifying

  1. Alleles that exist in both _nuc and _gen fasta files
  2. Alleles that exist in only _nuc fasta file

Cheers, Ahmad

colinhercus commented 3 months ago

I use this awk script to find sequences that exist in hla_nuc.fasta but not in hla.fasta

awk ' /^[^>]/ {if(p==1) print; next} FNR == NR { G[$1] = 1; next } $1 in G { p=0; next} {print substr($1,2); p = 1; G[$1]=1 }' hla.fasta hla_nuc.fasta >HLA.exonOnly_nuc.id

colinhercus commented 3 months ago

For RNA-seq you should just use hla_nuc.fasta

dominicbarkerAN commented 3 months ago

Hello Ahmad, thank you for you're query. As discussed elsewhere including our FAQs, there are indeed differences between the number of alleles included in the hla_nuc.fasta and hla_gen.fasta. This is due to partial sequences with only exons which are included in the hla_nuc.fasta but not the hla_gen.fasta. With regards to how these files are used with something like scRNA-seq data you will need to seek support from the source of this dataset or sequencing/software provider you are using.

Best,

Dominic