DaehwanKimLab / hisat-genotype

GNU General Public License v3.0
25 stars 15 forks source link

Convert the result file to table #3

Closed memumab closed 4 years ago

memumab commented 4 years ago

Hello, I used hisat-genotype for HLA typing of many loci on a big sampleset. Do you have a suggestion what would be the easiest procedure to convert the .report file in a table with all the Loci (e.g. HLA-A1, A2, B1, B2, C1, C2 and so on) in the columns and all my samples in the rows without copy-pasting everything by hand?

Kind regards, Matze

chbe-helix commented 4 years ago

Hi Matze,

This is a feature I'm working on in the newest version of hisatgenotype. I'd be happy to send you the script ahead of the release. If this works for you, would you be willing to provide feedback on the formatting, general usability, and/or anything else you think would help make it a better tool?

Thanks, Chris

memumab commented 4 years ago

Dear Chris, nice to hear that you already thought about such a feature! Of course, I would be happy to test it and give you a feedback on it. How would you like to transfer the script, and would this work with the already produced result files or should I rerun the typing?

Kind regards and thanks in advance, Matze

chbe-helix commented 4 years ago

Fabulous! Let me collect the script(s) and send you a collected version. I'll let you know how you can get it later today.

The script works on the .report files so you shouldn't need to rerun anything.

Thanks, Chris

chbe-helix commented 4 years ago

Hi Matze,

A beta version of the script has been added to hisatgenotype v1.1.3. You can find it by updating your directory with:

git pull

Then looking in __hisatgenotype_scripts folder for file hisatgenotype_conc_results-beta.py__.

You'll likely have to run it as follows:

python $PATH_TO_DIR/hisatgenotype_conc_results-beta.py --in-dir $PATH_TO_REPORTS --csv > $OUT_FILE

After it runs you will have two files: 1) HG_report_results.csv (this is what I think you'll want) and 2) $OUT_FILE (A linear report of the reports). You'll notice an allele splitting category in the files. This is a new feature method for parsing out confidence calls for allele levels.

For example:

You are welcome to use this feature if you so desire. I'd love to hear your feed back on this script.

Thanks, Chris

memumab commented 4 years ago

Dear Chris, thank you very much for providing the script that fast. It worked smoothly and I think it is a super useful extension for hisat-genotype! I like that it recapitulates the result file and then chooses the most probable alleles in a second step. Although this process seems to be working nicely, it often still reports three or more options, one of which is marked "partial" as you described above. Example:

Analysis EM: Gene: DRB1 DRB116:02:01:01 (abundance: 52.22%) DRB109:01:02 (abundance: 37.45%) *DRB116:02:01:02** (abundance: 10.33%)

Analysis Allele splitting: Gene: DRB1 DRB116:02:01 - Partial (score: 0.6255) DRB116:02:01:01 (score: 0.5222) *DRB109:01:02** (score: 0.3745)

So I guess the partial DRB116:02:01 already contains the DRB116:02:01:01, but the latter is still reported. Wouldn't it make sense to drop the *DRB116:02:01:01** in this case, so that the scores are adding up to 1.0 again?

Another suggestion, at least concerning my goals with the HLA-Typing, would be to cut it down to a four digit allele. So the genotype of the example would simply be DRB116:02 DRB109:01 and if it's now DRB116:02:01:01, DRB116:02:01:02 or another *DRB116:02** (and in most cases there is this uncertainty) will probably not be further resolvable, but will probably also not matter for most users. For my part I'm only interested in the allele down to the protein sequence (four digit) and whether someone is heterozygous or homozygous for this. This would reduce the need for manual curation of the table even more ;)

Thanks again for helping me out and continue the good work! Kind regards, Matze

chbe-helix commented 4 years ago

Hi Matze,

Thank you for the great suggestions! I'll add them to my list of improvements to make to the script. I especially appreciate the suggestion for cutting the alleles down to different levels. That will be a very useful and doable feature in the upcoming release. For the time being, I'll mark this issue as closed. Watch for HISATgenotype version 1.2 to come out in or before Q2 2020.

Thanks! -Chris