Gabaldonlab / jloh

A tool to extract LOH blocks from VCF, BAM and FASTA data
http://jloh.readthedocs.io
GNU General Public License v3.0
21 stars 2 forks source link

Errors running test dataset in docker #10

Closed argrs-mtjs closed 9 months ago

argrs-mtjs commented 9 months ago

Hi! I'm trying to run jloh using docker, but I am running into some errors when following instructions (https://jloh.readthedocs.io/en/latest/usage/run_test_data.html) for the test dataset provided.

For the "jloh stats" command, I get:

[Wed Jan 24 17:05:59 2024] Reading SNPs
[Wed Jan 24 17:05:59 2024] found 19 het SNPs and 114 homo SNPs
[Wed Jan 24 17:05:59 2024] Reading chrom lengths from VCF header
[Wed Jan 24 17:05:59 2024] Read 1 chromosome names and their lengths
[Wed Jan 24 17:05:59 2024] Calculating heterozygous SNP densities
Traceback (most recent call last):
  File "/root/src/jloh/src/stats", line 331, in <module>
    main(args)
  File "/root/src/jloh/src/stats", line 263, in main
    Het, Het_quant = calculate_chrom_snp_densities(Hetero_lines, Chrom_lengths, args)
  File "/root/src/jloh/src/stats", line 244, in calculate_chrom_snp_densities
    df_quant = df.quantile(q=[0.05, 0.10, 0.15, 0.50, 0.85, 0.90, 0.95])
  File "/usr/local/lib/python3.9/site-packages/pandas/core/frame.py", line 11834, in quantile
    res = data._mgr.quantile(qs=q, interpolation=interpolation)
  File "/usr/local/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 1507, in quantile
    blocks = [
  File "/usr/local/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 1508, in <listcomp>
    blk.quantile(qs=qs, interpolation=interpolation) for blk in self.blocks
  File "/usr/local/lib/python3.9/site-packages/pandas/core/internals/blocks.py", line 1587, in quantile
    result = quantile_compat(self.values, np.asarray(qs._values), interpolation)
  File "/usr/local/lib/python3.9/site-packages/pandas/core/array_algos/quantile.py", line 39, in quantile_compat
    return quantile_with_mask(values, mask, fill_value, qs, interpolation)
  File "/usr/local/lib/python3.9/site-packages/pandas/core/array_algos/quantile.py", line 97, in quantile_with_mask
    result = _nanpercentile(
  File "/usr/local/lib/python3.9/site-packages/pandas/core/array_algos/quantile.py", line 218, in _nanpercentile
    return np.percentile(
  File "/usr/local/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4283, in percentile
    return _quantile_unchecked(
  File "/usr/local/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4555, in _quantile_unchecked
    return _ureduce(a,
  File "/usr/local/lib/python3.9/site-packages/numpy/lib/function_base.py", line 3823, in _ureduce
    r = func(a, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4721, in _quantile_ureduce_func
    result = _quantile(arr,
  File "/usr/local/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4840, in _quantile
    result = _lerp(previous,
  File "/usr/local/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4655, in _lerp
    diff_b_a = subtract(b, a)
TypeError: unsupported operand type(s) for -: 'str' and 'str'

For the "jloh extract", everything runs as expected.

And for the "jloh plot", I get:

INFO: Pandarallel will run on 12 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
[Wed Jan 24 17:07:47 2024] Reading input information
[Wed Jan 24 17:07:47 2024] Quantizing heterozygosity in windows of 10000 bp
Parsing rows: 100.0%
[Wed Jan 24 17:07:47 2024] Sorting by genome coordinate
[Wed Jan 24 17:07:47 2024] Quantizing intervals in windows of 10000 bp
Parsing rows: 100.0%
[Wed Jan 24 17:07:47 2024] Sorting by genome coordinate
[Wed Jan 24 17:07:48 2024] Writing table to output
[Wed Jan 24 17:07:48 2024] Plotting

Plot command that was run:
Rscript /root/src/jloh/src/scripts/loh-bin-plots_one-ref.Rscript jloh_out/plot.LOH_rate.tsv jloh_out/plots by_chromosome /input/jloh.LOH_blocks.tsv 0.35,2000,750,250 REF,ALT \#F7C35C,\#EF6F6C,\#64B6AC,\#ffffff no plot max

Error in library(reshape2) : there is no package called ‘reshape2’
Execution halted

Can you please advise? Thanks in advance

MatteoSchiavinato commented 9 months ago

Would you be willing to share the dataset, since it's only a few SNPs, so we can inquire into the error? The first does not seem related to docker.

argrs-mtjs commented 9 months ago

It is the out.ff.vcf that you provide in https://github.com/Gabaldonlab/jloh/tree/master/test_data. I was just testing if everything was ok before proceeding to my dataset.

I suspected that the first error could be unrelated to docker...

MatteoSchiavinato commented 9 months ago

We'll look into that. Perhaps try with another VCF you have lying around, see if the same thing happens.

argrs-mtjs commented 9 months ago

I tested with one of my VCF and got the same error. I will wait for further insights from you

MatteoSchiavinato commented 9 months ago

Could you tell me how many variants are in your VCF, and if you have any chromosome with no variants?

argrs-mtjs commented 9 months ago

Sure! I have 87129 SNPs in my VCF and all the chromosomes have variants.

Besides giving the error I reported above, jloh seems to be able to read the VCF almost as it should. I noticed a small discrepancy in the number of het SNPs (it should be 81905), whereas the number of homo SNPs is correct.

[Thu Jan 25 15:35:44 2024] Reading SNPs
[Thu Jan 25 15:35:44 2024] found 81840 het SNPs and 5224 homo SNPs
[Thu Jan 25 15:35:44 2024] Reading chrom lengths from VCF header
[Thu Jan 25 15:35:44 2024] Read 9 chromosome names and their lengths
[Thu Jan 25 15:35:44 2024] Calculating heterozygous SNP densities
MatteoSchiavinato commented 9 months ago

The discrepancy could be due to the default values of --min-af and --max-af. As for the rest, I'm going to keep investigating.

MatteoSchiavinato commented 9 months ago

The above commit should fix the issue wtih reshape2, which was derived by something on their side.

MatteoSchiavinato commented 9 months ago

Could you give me your exact command to launch the test dataset? Might help.

argrs-mtjs commented 9 months ago

Thank you for the commit. It fixed the first problem with the "jloh stats". The "jloh plot" now gives an error with the package "hash":

   import pandas as pd
INFO: Pandarallel will run on 12 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
[Tue Jan 30 15:18:14 2024] Reading input information
[Tue Jan 30 15:18:14 2024] Quantizing heterozygosity in windows of 10000 bp
Parsing rows: 100.0%
[Tue Jan 30 15:18:14 2024] Sorting by genome coordinate
[Tue Jan 30 15:18:14 2024] Quantizing intervals in windows of 10000 bp
Parsing rows: 100.0%
[Tue Jan 30 15:18:14 2024] Sorting by genome coordinate
[Tue Jan 30 15:18:14 2024] Writing table to output
[Tue Jan 30 15:18:14 2024] Plotting

Plot command that was run:
Rscript /root/src/jloh/src/scripts/loh-bin-plots_one-ref.Rscript jloh_out/plot.LOH_rate.tsv jloh_out/plots by_chromosome /input/jloh.LOH_blocks.tsv 0.35,2000,750,250 REF,ALT \#F7C35C,\#EF6F6C,\#64B6AC,\#ffffff no plot max

Error in library(hash) : there is no package called ‘hash’
Execution halted

The exact command for the plot is: docker run -v $PWD:/input -t -i --rm cgenomics/jloh jloh plot --one-ref --loh /input/jloh.LOH_blocks.tsv --het /input/jloh.exp.het_blocks.bed --contrast max

MatteoSchiavinato commented 9 months ago

Hi @argrs-mtjs , we have issued a new release that fixes the problem you brought up: https://github.com/Gabaldonlab/jloh/commit/cd40daed6e5e7308e67771b50b0af9ac3939217e.

You will have to rebuild the docker image.

Thanks for pointing it out, we hope that now it works. We have tested it and it works fine for us.