BIMSBbioinfo / pigx_scrnaseq

Pipeline for analysis of Dropseq single cell data
http://bioinformatics.mdc-berlin.de/pigx
10 stars 6 forks source link

Local pipeline install fails during convert_matrix_from_mtx_to_loom.py #52

Open t-carroll opened 4 years ago

t-carroll commented 4 years ago

Hi BIMSB team,

I've been working on getting your pipeline up and running as a non-root user on our group's compute cluster. Unfortunately, installing guix isn't an option in my case, but after some trial and error I was able to put together a conda environment with the listed dependencies that manages to get through the configure/make process.

However, when I go to run the pipeline on the test dataset, the pipeline fails with the following message:

./pigx-scrnaseq tests/sample_sheet.csv -s tests/settings.yaml

Error in job convert_matrix_from_mtx_to_loom while creating output file /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Mapped/WT_HEK_4h_br1/hg19/WT_HEK_4h_br1_hg19_UMI.matrix.loom. RuleException: CalledProcessError in line 37 of /users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/scripts/Accessory_Functions.py: Command '/users/tcarroll/sharedscratch/anaconda3/envs/pigx_scrnaseq/bin/python /users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/scripts/convert_matrix_from_mtx_to_loom.py --sample_id WT_HEK_4h_br1 --input_dir /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Mapped/WT_HEK_4h_br1/hg19/WT_HEK_4h_br1_Solo.out --gtf_file /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Annotation/hg19/hg19.gtf --star_output_types_keys Gene GeneFull Velocyto Velocyto --star_output_types_vals Counts GeneFull Spliced Unspliced --output_file /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Mapped/WT_HEK_4h_br1/hg19/WT_HEK_4h_br1_hg19_UMI.matrix.loom --sample_sheet_file /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/sample_sheet.csv --path_script /users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/scripts &> /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Log/WT_HEK_4h_br1.hg19.convert_matrix_from_mtx_to_loom.log' returned non-zero exit status 1. File "/users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/Snake_Dropseq.py", line 769, in __rule_convert_matrix_from_mtx_to_loom File "/users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/scripts/Accessory_Functions.py", line 37, in print_shell File "/users/tcarroll/sharedscratch/anaconda3/envs/pigx_scrnaseq/lib/python3.6/concurrent/futures/thread.py", line 56, in run Exiting because a job execution failed. Look above for error message

I then checked the log referenced in the error message:

cat /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Log/WT_HEK_0h_br1.hg19.convert_matrix_from_mtx_to_loom.log

Parsing gene ids from gtf file /mnt/lustre/users/tcarroll/bin/git/pigx_scrnaseq/tests/out/Annotation/hg19/hg19.gtf Reading input files ... Counts Features ... Traceback (most recent call last): File "/users/tcarroll/sharedscratch/bin/git/pigx_scrnaseq/run/libexec/pigx_scrnaseq/scripts/convert_matrix_from_mtx_to_loom.py", line 102, in genes.columns = ['gene_id','gene_id2'] File "/users/tcarroll/sharedscratch/anaconda3/envs/pigx_scrnaseq/lib/python3.6/site-packages/pandas/core/generic.py", line 5287, in setattr return object.setattr(self, name, value) File "pandas/_libs/properties.pyx", line 67, in pandas._libs.properties.AxisProperty.set File "/users/tcarroll/sharedscratch/anaconda3/envs/pigx_scrnaseq/lib/python3.6/site-packages/pandas/core/generic.py", line 661, in _set_axis self._data.set_axis(axis, labels) File "/users/tcarroll/sharedscratch/anaconda3/envs/pigx_scrnaseq/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 178, in set_axis f"Length mismatch: Expected axis has {old_len} elements, new " ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

Looks like some issue with pandas, where there's 2 elements instead of the expected 3? I'm not as experienced with Python so having trouble figuring this out. Perhaps it's some dependency I missed during my manual local install? If you could provide any advice on what might be going on here or where to look for more troubleshooting info it would be most appreciated. I'm running in a CentOS Linux 7 (Core) environment, with tools and packages installed by Conda where possible and manually otherwise.

Thanks for the help! -Tom

t-carroll commented 4 years ago

Was able to troubleshoot this further. Looks like in convert_matrix_from_mtx_to_loom.py, there is a block of code:

    path_genes    = os.path.join(path_input,'features.tsv')
    genes         = pd.read_csv(path_genes, sep='\t', header=None)
    genes.columns = ['gene_id','gene_id2']
    genes         = genes.drop(columns = ['gene_id2'])

Where it is expecting 2 columns in the "features.tsv" file. However, all of the relevant features.tsv files generated by the test pigx_scrnaseq command in my environment generate 3 column outputs. Here's an example, but it's the same for all the various features.tsv generated for each sample: head -3 tests/out/Mapped/WT_HEK_4h_br1/hg19/WT_HEK_4h_br1_Solo.out/Gene/filtered/features.tsv

ENSG00000080573 ENSG00000080573 Gene Expression ENSG00000267650 ENSG00000267650 Gene Expression ENSG00000080511 ENSG00000080511 Gene Expression

I therefore modified the relevant code on my fork's devel branch to the following:

    path_genes    = os.path.join(path_input,'features.tsv')
    genes         = pd.read_csv(path_genes, sep='\t', header=None)
    genes.columns = ['gene_id','gene_id2','gene_expression']
    genes         = genes.drop(columns = ['gene_id2','gene_expression'])

And the test command now completes without error. Do the features.tsv files generated by your guix install on the test command only have two files? If so I imagine I have the wrong version of some dependency (maybe STAR? I have 2.7.5a).

In any case, the HTML report generated by my test command (attached) seemingly indicates I am having further issues with the pipeline, and may be generating incorrect results. For instance, the number of uniquely mapped UMIs is a tiny fraction of the total UMIs, and plots 3.0.2 and 3.0.3 are completely blank. Would it be possible to generate an example report for your test pipeline command (e.g. hg19.scRNA-Seq.report.html generated by your own installation) for users to compare against to insure proper installation?

Thanks for the help! hg19.scRNA-Seq.report_files.zip

frenkiboy commented 4 years ago

Dear Tom,

Thank you so much for reporting the message. I'll look into the error and sort it out.

Could you please just tell me which pandas version you are using?

Best,

Vedran

t-carroll commented 4 years ago

Hi Vedran,

Thanks for looking into it! It looks like I'm using pandas version 1.0.3:

pip freeze | grep "pandas"

pandas==1.0.3