fiberseq / fibertools-rs

Tools for fiberseq data written in rust.
https://fiberseq.github.io/fibertools/fibertools.html
42 stars 5 forks source link

FIRE on targeted seq data: fiber-locations-shuffled.bed.gz is created empty #37

Closed Strausyatina closed 11 months ago

Strausyatina commented 11 months ago

Hi Mitchell! We've tried to run FIRE on targeted seq data, and pipeline is failing with "polars.exceptions.NoDataError: empty CSV", since fiber-locations-shuffled.bed.gz is created empty.

The bed file with complement to targeted regions was used for exclusion in filtered_and_shuffled_fiber_locations_chromosome.

What could be an issue in our usage of FIRE? Is it suitable for such a task?

Config yaml:

ref: /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa
ref_name: hg38
n_chunks: 1 # split bam file across x chunks
max_t: 4 # use X threaeds per chunk
manifest: config/config_targeted_project.tbl # table with samples to process

keep_chromosomes: chr4 # only keep chrs matching this regex.
keep_chromosomes: chr7
keep_chromosomes: chr20
## Force a read coverage instead of calulating it genome wide from the bam file.
## This can be useful if only a subset of the genome has reads.
#force_coverage: 50

## regions to not use when identifying null regions that should not have RE, below are the defaults auto used for hg38.
excludes:
 - workflow/annotations/hg38.fa.sorted.bed
#- workflow/annotations/hg38.gap.bed.gz
#- workflow/annotations/SDs.merged.hg38.bed.gz

## you can optionally specify a model that is not the default.
# model: models/my-custom-model.dat

##
## only used if training a new model
##
# train: True
# dhs: workflow/annotations/GM12878_DHS.bed.gz # regions of suspected regulatory elements

Example of error log:

Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=204800, mem_mib=195313, disk_mb=4096, disk_mib=3907, time=100440, gpus=0
Select jobs to execute...
[Thu Dec 28 16:04:53 2023]
rule fdr_table:
    input: results/bc2031/fiber-calls/FIRE.bed.gz, results/bc2031/coverage/filtered-for-coverage/fiber-locations.bed.gz, results/bc2031/coverage/filtered-for-coverage/fiber-locations-shuffled.bed.gz, /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa.fai
    output: results/bc2031/FDR-peaks/FIRE.score.to.FDR.tbl
    jobid: 0
    reason: Forced execution
    wildcards: sm=bc2031
    threads: 8
    resources: mem_mb=204800, mem_mib=195313, disk_mb=4096, disk_mib=3907, tmpdir=/tmp, time=100440, gpus=0
        python /home/nshaikhutdinov/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpiwuex449/file/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/../scripts/fire-null-distribution.py -v 1 results/bc2031/fiber-calls/FIRE.bed.gz results/bc2031/coverage/filtered-for-coverage/fiber-locations.bed.gz /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa.fai -s results/bc2031/coverage/filtered-for-coverage/fiber-locations-shuffled.bed.gz -o results/bc2031/FDR-peaks/FIRE.score.to.FDR.tbl

Activating conda environment: ../../../../../../../home/nshaikhutdinov/FIRE/env/72529d38651d38b3fc44b5aae6fe7a22_
[INFO][Time elapsed (ms) 1068]: Reading FIRE file: results/bc2031/fiber-calls/FIRE.bed.gz
/home/nshaikhutdinov/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpiwuex449/file/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/../scripts/fire-null-distribution.py:486: DeprecationWarning: `the argument comment_char` for `read_csv` is deprecated. It has been renamed to `comment_prefix`.
  fire = pl.read_csv(
[INFO][Time elapsed (ms) 1082]: Reading genome file: /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa.fai
[INFO][Time elapsed (ms) 1085]: Reading fiber locations file: results/bc2031/coverage/filtered-for-coverage/fiber-locations.bed.gz
[INFO][Time elapsed (ms) 1095]: Reading shuffled fiber locations file: results/bc2031/coverage/filtered-for-coverage/fiber-locations-shuffled.bed.gz
Traceback (most recent call last):
  File "/home/nshaikhutdinov/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpiwuex449/file/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/../scripts/fire-null-distribution.py", line 539, in <module>
    defopt.run(main, show_types=True, version="0.0.1")
  File "/home/nshaikhutdinov/.local/lib/python3.11/site-packages/defopt.py", line 356, in run
    return call()
           ^^^^^^
  File "/home/nshaikhutdinov/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpiwuex449/file/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/../scripts/fire-null-distribution.py", line 517, in main
    shuffled_locations = pl.read_csv(
                         ^^^^^^^^^^^^
  File "/home/nshaikhutdinov/.local/lib/python3.11/site-packages/polars/utils/deprecation.py", line 100, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nshaikhutdinov/.local/lib/python3.11/site-packages/polars/io/csv/functions.py", line 369, in read_csv
    df = pl.DataFrame._read_csv(
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nshaikhutdinov/.local/lib/python3.11/site-packages/polars/dataframe/frame.py", line 784, in _read_csv
    self._df = PyDataFrame.read_csv(
               ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.NoDataError: empty CSV
[Thu Dec 28 16:04:54 2023]
Error in rule fdr_table:
    jobid: 0
    input: results/bc2031/fiber-calls/FIRE.bed.gz, results/bc2031/coverage/filtered-for-coverage/fiber-locations.bed.gz, results/bc2031/coverage/filtered-for-coverage/fiber-locations-shuffled.bed.gz, /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa.fai
    output: results/bc2031/FDR-peaks/FIRE.score.to.FDR.tbl
    conda-env: /home/nshaikhutdinov/FIRE/env/72529d38651d38b3fc44b5aae6fe7a22_
    shell:

        python /home/nshaikhutdinov/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpiwuex449/file/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/../scripts/fire-null-distribution.py -v 1 results/bc2031/fiber-calls/FIRE.bed.gz results/bc2031/coverage/filtered-for-coverage/fiber-locations.bed.gz /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa.fai -s results/bc2031/coverage/filtered-for-coverage/fiber-locations-shuffled.bed.gz -o results/bc2031/FDR-peaks/FIRE.score.to.FDR.tbl

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Index(['bc2029', 'bc2031', 'bc2025', 'bc2027', 'bc2026', 'bc2032', 'bc2030',
       'bc2028'],
      dtype='object', name='sample')

Exclusion bed file:

chr1    1   248956422
chr10   1   133797422
chr11   1   135086622
chr12   1   133275309
chr13   1   114364328
chr14   1   107043718
chr15   1   101991189
chr16   1   90338345
chr17   1   83257441
chr18   1   80373285
chr19   1   58617616
chr2    1   242193529
chr20   1   4680670
chr20   4690391 64444167
chr21   1   46709983
chr22   1   50818468
chr3    1   198295559
chr4    1   3072454
chr4    3077294 190214555
chr5    1   181538259
chr6    1   170805979
chr7    1   140917955
chr7    140927420   159345973
chr8    1   145138636
chr9    1   138394717
chrM    1   16569
chrX    1   156040895
chrY    1   57227415
mrvollger commented 11 months ago

This is an issue for the FIRE repo and not fibertools. Can you please repost there.