FRED-2 / OptiType

Precision HLA typing from next-generation sequencing data
BSD 3-Clause "New" or "Revised" License
183 stars 74 forks source link

hla translation #79

Open ktroule opened 6 years ago

ktroule commented 6 years ago

Hi For some downstream analysis I need to identify the fasta sequence used for the hla allele called. For instance, if optitype called A*01:01 I would need to know its fasta sequence, which would be >HLA00001 HLA-A*01:01:01:01.

But A*01:01 matches in many places. I would be happy if there is a way to go from A*01:01 to HLA00001

Is there a way to obtain this?

Thanks

andras86 commented 6 years ago

Hi, You can tweak the code around line 421 in OptiTypePipeline.py at result_4digit = result.applymap(get_types). result contains the HLA00001-like identifiers, you can either just write that result DataFrame to a custom file, or just say result_4digit = result.

ktroule commented 6 years ago

Great. I'll explore it.

Rashesh7 commented 6 years ago

Hi @andras86 , I had a similar question. Did you have a reason to just use the 4digits like A01:01 instead of the whole allele name A01:01:01:01 ?

In http://hla.alleles.org/ they have combined alleles with the same exon2 and exon3 sequences into G groups whereas if exon2 and exon3 code for the same protein(even with some difference in the genomic sequences) are combined together into a P group.

Would the 4digit output by OptiType be similar to a P group or G group?

Your help and suggestion is highly appreciated. Thanks.

andras86 commented 6 years ago

Hi Rashesh,

The main reason is because there was no way to benchmark the accuracy of the calls beyond 4-digit level, hence avoiding "significant digits" in the csv than we couldn't vouch for. As for whether the 4-digit csv output is more akin to the P or G groups: no such group look-ups are performed, they are trimmed versions of the full name. Of course for most alleles this will match their P-group, but we aren't doing it actively.

Rashesh7 commented 6 years ago

Hi Andras, Thank you for the quick reply. That makes sense. If we need the full allele name I can edit OptiTypePipeline.py as you mentioned above.

Last couple of questions: 1) Is there a plan to update the backend HLA database? The latest release just recently came out. (3.33.0) 2) Is there a plan to add Somatic variant calling?

I apologize for many questions. Its just I recently benchmarked a few tools and found OptiType to be more robust across technologies and with good accuracy. And it would be great to have a somatic calling capability.

Thank you!

andras86 commented 6 years ago

Hi Rashesh, I know the next major release (handling Class II, somatic calling and a fresh database) is becoming a running joke by now, but it's coming. I'm glad to hear that you found OptiType robust across technologies!