broadinstitute / gtex-pipeline

GTEx & TOPMed data production and analysis pipelines
BSD 3-Clause "New" or "Revised" License
343 stars 175 forks source link

Bugfixes #62

Open dampierch opened 3 years ago

dampierch commented 3 years ago

Intro

Thank you for sharing your code. I am using it to analyze some of my own data. In particular, I am using the qtl scripts. I encountered a few bugs while running eqtl_prepare_expression.py and fixed them in a local branch. I thought I should share them with you in case the fixes are helpful for others using your code.

Fixes

  1. Added pyqtl to Dockerfile per Issue #61
  2. Added some functions to make numeric indices in the sample_participant_lookup run smoothly.
  3. Added an option to ignore ENSEMBL gene version numbers in the merging part of the BED preparation.
    • One should never have different ENSEMBL gene version numbers if the gene model is the same in all processing steps. Unfortunately, sometimes the gene model used in a particular step is unknown due to insufficient documentation from a collaborator or commercial service.
ActioTom commented 2 years ago

Also, I believe the --convert-tpm option is broken because of a bug around line 100 in eqtl_prepare_expression.py:

    if args.convert_tpm:
        print('  * Converting to TPM', flush=True)
        tpm_df = tpm_df / tpm_df.sum(0) * 1e6

This fails because the first column of tpm_df is Name and contains the gene ids rather than numeric data.