Output in TSV format? - Githubissues

tinyheero commented 4 years ago

Hi there,

I've been able to get HISAT-genotype running on the tutorial data (https://daehwankimlab.github.io/hisat-genotype/tutorials/). The output report (assembly_graph-hla-NA12892_extracted_1_fq_gz-hla-extracted-1_fq.report) contains the results in a format that isn't readily parsed by a downstream program:

# VERSIONS:
# HISAT2 - 2.2.0

# HISAT-genotype - 1.3.0

# Database - Database hla derived from HISATgenotype DB version: NONE
# COMMAND:
/home/f.chan/hisatgenotype/hisatgenotype --base hla --locus-list A,B,C,DPA1,DPB1,DRB1,DQA1 -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz
  A B C DPA1 DPB1 DRB1 DQA1

    hisat2 graph
      1496 reads and 769 pairs are aligned
        1 A*02:01:01:01 (count: 419)
        2 A*02:251 (count: 407)
        3 A*02:610:02 (count: 405)
        4 A*02:524:02 (count: 403)
        5 A*02:650 (count: 403)
        6 A*02:01:123 (count: 402)
        7 A*02:647 (count: 401)
        8 A*02:01:126 (count: 400)
        9 A*02:562 (count: 400)
        10 A*02:645 (count: 400)

        1 ranked A*02:01:01:01 (abundance: 51.95%)
        2 ranked A*11:01:01:01 (abundance: 48.05%)
      1331 reads and 676 pairs are aligned
        1 B*15:01:01:04 (count: 383)
        2 B*15:01:01:01 (count: 361)
        3 B*56:01:01:03 (count: 353)
        4 B*56:01:01:02 (count: 352)
        5 B*15:01:01:02N (count: 351)
        6 B*15:15 (count: 346)
        7 B*15:11:01 (count: 344)
        8 B*15:340 (count: 344)
        9 B*15:272N (count: 341)
        10 B*15:08:01 (count: 340)

I was just wondering if there was a way to get the results in a more standard format? For instance, in a TSV file?

Kind regards,

Fong

chbe-helix commented 4 years ago

Hi Fong,

I actually have a script to do just that as part of hisatgenotype_toolkit. There is currently a bug with the file format for HISATgenotype if you're on linux so please follow the steps I have outlined:

Run the following replacing the cd command with the directory you have hisatgenotype in.

cd $DIR/$WITH/$HISATGENOTYPE/
for file in */*; do sed -i 's/\r$//' $file; done

Then you can run the following command:

hisatgenotype_toolkit conc-results --help

You should see the help screen for the conc-results script. This will format the output as a TSV or CSV.

Let me know if this works for you.

Thanks, Chris

tinyheero commented 4 years ago

Thanks Chris.

Are you referring to the directory with HISAT-genotype output or where it is installed? When I try to run:

hisatgenotype_toolkit conc-results --help

I get this error message:

: No such file or directory

I get this error with or without the sed command you mentioned above (if you were referring to applying that to where it was installed). I have the following in my ~/.bashrc:

export PATH=~/hisatgenotype:~/hisatgenotype/hisat2:$PATH
export PYTHONPATH=~/hisatgenotype/hisatgenotype_modules:${PYTHONPATH}

So it is finding the hisatgenotype executables fine.

chbe-helix commented 4 years ago

Hi Fong,

Could you share the output of the following command:

cd ~/hisatgenotype
ls -alh
ls -alh hisatgenotype_tools/

I'd like to see what is going on with your tools directory as the toolkit script should be reading the options from there.

To your first question, where hisatgenotype is installed is where you should run that command. It looks like you are not having the carriage return error so it is something else.

Thanks, Chris

tinyheero commented 4 years ago

Hi Chris,

Here is the output:

$> ls -alh hisatgenotype_tools/
total 240K
drwxrwxr-x 2 f.chan f.chan 4.0K Jul 27 21:40 .
drwxrwxr-x 7 f.chan f.chan 4.0K Jul 27 21:14 ..
-rwxrwxr-x 1 f.chan f.chan  24K Jul 27 21:40 hisatgenotype_build_genome.py
-rwxrwxr-x 1 f.chan f.chan  13K Jul 27 21:40 hisatgenotype_call_variants.py
-rwxrwxr-x 1 f.chan f.chan 5.7K Jul 27 21:40 hisatgenotype_conc_results.py
-rwxrwxr-x 1 f.chan f.chan  29K Jul 27 21:40 hisatgenotype_convert_codis.py
-rwxrwxr-x 1 f.chan f.chan 6.8K Jul 27 21:40 hisatgenotype_extract_codis_data.py
-rwxrwxr-x 1 f.chan f.chan  42K Jul 27 21:40 hisatgenotype_extract_cyp_data.py
-rwxrwxr-x 1 f.chan f.chan  39K Jul 27 21:40 hisatgenotype_extract_RBG.py
-rwxrwxr-x 1 f.chan f.chan 5.1K Jul 27 21:40 hisatgenotype_extract_reads.py
-rwxrwxr-x 1 f.chan f.chan 4.2K Jul 27 21:40 hisatgenotype_extract_vars.py
-rwxrwxr-x 1 f.chan f.chan  20K Jul 27 21:40 hisatgenotype_legacy.py
-rwxrwxr-x 1 f.chan f.chan 7.6K Jul 27 21:40 hisatgenotype_locus.py
-rwxrwxr-x 1 f.chan f.chan  16K Jul 27 21:40 hisatgenotype_locus_samples.py

chbe-helix commented 4 years ago

Hi Fong,

I'm not sure why the wrapper isn't reading the proper script. You can run the script directly from that folder though using the following code:

python ~/hisatgenotype/hisatgenotype_tools/hisatgenotype_conc_results.py --help

That should allow you to run the conc_results script directly without the wrapper looking for it. If you see the help screen from that then you can use the --in-dir option with the directory you have the results output of hisatgenotype. Let me know if this works for you.

Thanks, Chris

tinyheero commented 4 years ago

Thanks Chris. It worked. I actually ended up passing the wrapper as a script to python. So that might highlight where the error is:

$> hisatgenotype_toolkit
: No such file or directory

$> python ~/hisatgenotype/hisatgenotype_toolkit --help
usage: hisatgenotype_toolkit [-h] [--see-script]
                             {build-genome,call-variants,conc-results,convert-codis,extract-RBG,extract-codis-data,extract-cyp-data,extract-reads,extract-vars,legacy,locus,locus-samples}
                             ...

HISAT-genotype Toolkit

optional arguments:
  -h, --help            show this help message and exit
  --see-script          Prints the exact script and options being run

Tools:
  {build-genome,call-variants,conc-results,convert-codis,extract-RBG,extract-codis-data,extract-cyp-data,extract-reads,extract-vars,legacy,locus,locus-samples}
                        HISAT-genotype tools and individual scripts
    build-genome        hisatgenotype_build_genome.py
    call-variants       hisatgenotype_call_variants.py
    conc-results        hisatgenotype_conc_results.py
    convert-codis       hisatgenotype_convert_codis.py
    extract-RBG         hisatgenotype_extract_RBG.py
    extract-codis-data  hisatgenotype_extract_codis_data.py
    extract-cyp-data    hisatgenotype_extract_cyp_data.py
    extract-reads       hisatgenotype_extract_reads.py
    extract-vars        hisatgenotype_extract_vars.py
    legacy              hisatgenotype_legacy.py
    locus               hisatgenotype_locus.py
    locus-samples       hisatgenotype_locus_samples.py

$> python ~/hisatgenotype/hisatgenotype_toolkit conc-results --in-dir hisatgenotype_out
File: hisatgenotype_out/assembly_graph-hla-NA12892_extracted_1_fq_gz-hla-extracted-1_fq.report
        Analysis - EM
                Gene: A
                        A*02:01:01:01 (abundance: 51.95%)

Thanks. I guess it's currently hardcoded to HG_report_results.csv (https://github.com/DaehwanKimLab/hisat-genotype/blob/master/hisatgenotype_tools/hisatgenotype_conc_results.py#L132). Are you open to pull requests to add a parameter to control that? I am planning to run this on numerous cases so it's best if they don't overwrite the same file :)

chbe-helix commented 4 years ago

Hi Fong,

I'm glad it's working for you. Sure thing! You're free to submit a pull request and I'll integrate it in. I'm happy to have any assistance from users in adding in useful parameters. If you have any other suggestions for any script, feel free to submit them as a feature request or a pull request and I'll be sure to work on adding it into the code base.

If this solves your issue, I'll go ahead and close this issue.

Thanks, Chris

DaehwanKimLab / hisat-genotype

Output in TSV format? #16