DaehwanKimLab / hisat-genotype

GNU General Public License v3.0
25 stars 15 forks source link

Trimming the final two fields in the HLA-alleles #42

Open Tijs-dot opened 3 years ago

Tijs-dot commented 3 years ago

Hi Chris,

I was wondering if there is a way to trim the fields that distinguish between different HLA calls for the same allele, for the abundance values when using the -t option from hisatgenotype_toolkit. I wanted to convert the .report output from HISAT-genotype into a csv format, while trimming the last two fields (by using -t 2), so I would be able to group all alleles that only differ from one another in the final two fields of the nomenclature. However, this only seems to trim the allele count fields, not the ones in the abundance part of the report output. So for example in the output below, the 5 alleles ranked at the bottom would remain untrimmed. Actually four of them are the HLA-A*02:01 allele, and I would like them to be counted together as that allele.

4392 reads and 2212 pairs are aligned 1 A02:01:01:01 (count: 1550) 2 A02:01:01:05 (count: 1550) 3 A02:01:01:06 (count: 1550) 4 A02:01:01:07 (count: 1550) 5 A02:01:01:08 (count: 1548) 6 A02:01:126 (count: 1530) 7 A02:648 (count: 1525) 8 A02:30:01 (count: 1523) 9 A02:562 (count: 1523) 10 A02:498 (count: 1518)

1 ranked A01:01:01:01 (abundance: 30.96%) 2 ranked A02:01:01:01 (abundance: 17.26%) 3 ranked A02:01:01:05 (abundance: 17.26%) 4 ranked A02:01:01:06 (abundance: 17.26%) 5 ranked A*02:01:01:07 (abundance: 17.26%)

Does this question make sense to you?

Thanks in advance, Tijs

chbe-helix commented 3 years ago

Hi Tijs,

The parse-results should do just what you are asking. The script may have broken when you changed the word "abundance" in issue 40. I believe there is another error going on in that issue that had to do with an error in the number of words in a line. It is expecting 3 and finding less. I suspect if you change the script back to it's original state then that part will be fixed. Then it is just figuring out which file has a possible error.

Thanks, Chris

Tijs-dot commented 3 years ago

Hi Chris,

When I changed the word abundance, I only did so for the one in the command line. So I changed "keep-low-abundance-alleles" in the .report file but I left all other "abundance" terms unchanged. Then I trim it with the toolkit, but this only works on the counts. I think that maybe was unclear. Any suggestions on what else could be wrong?

Thanks Tijs

chbe-helix commented 3 years ago

Hi Tijs,

I understand now sorry for my misunderstanding. I'm running a few analyses for a collaborator right now. Let me test the script on their results and see what transpires. I'll update you as soon as that is done. In the meantime my first idea is to try to run the script without the -t option

Thanks, Chris

chbe-helix commented 3 years ago

Hi Tijs,

I just ran the results from my collaborator and the parse-results script is collecting the abundance rather than the count. My best suggestion right now is to restore hisatgenotype using git stash in the hisatgenotype code directory and try to run the parse-results again. It will probably have the line length error again and if that is the case I think you'll need to sift through your files to find the culprit. Sorry I can be of more help.

Thanks, Chris