epigen / LIQUORICE

A tool to detect tissue- and cancer- specific epigenetic signatures in WGS data of liquid biopsies
https://liquorice.readthedocs.io
GNU General Public License v3.0
8 stars 5 forks source link

`assert len(x)==len(y) AssertionError` #10

Closed jgarces02 closed 1 month ago

jgarces02 commented 1 month ago

Hi, thanks a lot for this tool!

I'm having some problems I cannot identify, please:

This is my code...

$ bedtools genomecov -bg -ibam 160023_IGO_15698_6.v2.sorted.markdups.bam > tmp.bedgraph
$ grep -v -e KI -e GL0 tmp.bedgraph > tmp2.bedgraph #remove patches, problems in next steps
$ bedGraphToBigWig tmp2.bedgraph http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes tmp2.bw

$ LIQUORICE \
    --refgenome_fasta ~/references/GRCh38/liquorice/hg38.p12.fa \
    --bedpathlist ~/references/GRCh38/liquorice/universal_DHSs.bed \
    --bamfile 160023_IGO_15698_6.v2.sorted.markdups.bam \
        --mappability_bigwig tmp2.bw \
        --blacklist "hg38" --n_cpus 2 --samplename tmp_liq

... and the error:

INFO: 2024/05/30 10:29:16

##############################################################
############  Running LIQUORICE on sample tmp_liq  ###########
##############################################################

INFO: 2024/05/30 10:29:16        Sampling fragment sizes, using 2 cores ...
INFO: 2024/05/30 10:29:35        Overview of global fragment lengths: median=173; percentiles: 10th=135, 25th=158, 75th=215, 90th=384
INFO: 2024/05/30 10:29:35        Average read length is 99.
INFO: 2024/05/30 10:29:37        Calculating coverage at 10000 regions regularily distributed across the genome to determine overall coverage ...
INFO: 2024/05/30 10:30:07        Mean coverage at the 10000 regions of length 1 is 4.968016368727116.
INFO: 2024/05/30 10:30:07        ###### Starting analysis for universal_DHSs ######
INFO: 2024/05/30 10:30:07        Splitting input bedfile '/home/garcesj/references/GRCh38/liquorice/universal_DHSs.bed' into bins ...
INFO: 2024/05/30 10:30:08        Wrote bed file with bin coordinates to 'bins.bed'.
INFO: 2024/05/30 10:30:08        Calculating coverage at every bin defined in the bed file 'bins.bed', using 2 cores - this may take a while ...
INFO: 2024/05/30 10:30:52        Retrieving genomic sequences for every bin ...
INFO: 2024/05/30 10:30:53        Retrieving mappability information for every bin ...
INFO: 2024/05/30 10:30:57        Calculating bias factors for bins of various sizes, using up to 2 cores  - this may take a while. (The following 'UserWarning's from modin can be ignored.)
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
INFO: 2024/05/30 10:34:48        Skipping the 5 central bins (numbers [40, 41, 42, 43, 44]) for training of the bias-model.
WARNING: 2024/05/30 10:34:48     Training dataset contained 39185 rows with empty values, these were discarded.
WARNING: 2024/05/30 10:34:48     Dataset that should be corrected contained 41640 rows with empty values, these were discarded.
INFO: 2024/05/30 10:34:48        Using CVs to train biasmodel and predict...
INFO: 2024/05/30 10:34:48        Using features: ['forward mappability', 'reverse mappability', 'max mappability', 'AAA', 'AAT', 'AAG', 'AAC', 'ATA', 'ATG', 'ATC', 'AGA', 'AGT', 'AGG', 'AGC', 'ACA', 'ACG', 'ACC', 'TAA', 'TAG', 'TAC', 'TTG', 'TTC', 'TGA', 'TGG', 'TGC', 'TCG', 'TCC', 'GAG', 'GAC', 'GTG', 'GGG', 'GGC', 'GCG', 'CAG', 'CGG', 'AA', 'AT', 'AG', 'AC', 'TA', 'TG', 'TC', 'GG', 'GC', 'CG', 'GC content']
INFO: 2024/05/30 10:34:48        Training-set R^2 for the first cross validation fold: 0.9525214490296251
INFO: 2024/05/30 10:34:48        Out-of-sample R^2 for the first cross validation fold: 0.017573519674813243
INFO: 2024/05/30 10:34:48        Training-set R^2 for the second cross validation fold: 0.9224141346551441
INFO: 2024/05/30 10:34:48        Out-of-sample R^2 for the second cross validation fold: -0.21582653491591075
INFO: 2024/05/30 10:34:48        Cross-validated R^2 overall: -0.07506725785757418
INFO: 2024/05/30 10:34:48        Cross-validated MSE overall: 0.17814457396358918
INFO: 2024/05/30 10:34:49        Fitting gaussian functions and intercept to coverage values ...
Traceback (most recent call last):
  File "/home/garcesj/software/miniconda3/envs/cfDNA.liquorice/bin/LIQUORICE", line 10, in <module>
    sys.exit(main())
  File "/home/garcesj/software/miniconda3/envs/cfDNA.liquorice/lib/python3.8/site-packages/liquorice/LIQUORICE.py", line 469, in main
    Workflows.run_liquorice_train_biasmodel_on_same_regions(
  File "/home/garcesj/software/miniconda3/envs/cfDNA.liquorice/lib/python3.8/site-packages/liquorice/utils/Workflows.py", line 687, in run_liquorice_train_biasmodel_on_same_regions
    fit_df=fit_gaussians_obj.fit_gaussian_models(**sigma_contraints)
  File "/home/garcesj/software/miniconda3/envs/cfDNA.liquorice/lib/python3.8/site-packages/liquorice/utils/FitGaussians.py", line 82, in fit_gaussian_models
    assert len(x)==len(y)
AssertionError
jgarces02 commented 1 month ago

Sorry, it was a problem with the bigWig generation. Already solved using another approach:

$ bamCoverage --bam 160023_IGO_15698_6.v2.sorted.markdups.bam -o tmp3.bw