andersonbrito / subsampler

12 stars 4 forks source link

ValueError Troubleshoot #1

Closed vestalgd closed 2 years ago

vestalgd commented 3 years ago

I keep receiving a ValueError code. I've tried using the default values and the correct formatted values and there is no change in the error message.

snakemake correct_bias Building DAG of jobs... Using shell: /bin/bash Provided cores: 4 Rules claiming more threads will be scaled down. Job counts: count jobs 1 correct_bias 1 epiweek_conversion 1 genome_matrix 3

[Sun Sep 12 14:47:08 2021] Job 2: Generate matrix of genome counts per day, for each element in column="code"

Traceback (most recent call last): File "scripts/get_genome_matrix.py", line 171, in print('\tOldest collected sampled = ' + df[date_col].min().strftime('%Y-%m-%d')) File "pandas/_libs/tslibs/nattype.pyx", line 60, in pandas._libs.tslibs.nattype._make_error_func.f ValueError: NaTType does not support strftime [Sun Sep 12 14:47:09 2021] Error in rule genome_matrix: jobid: 2 output: outputs/genome_matrix_days.tsv shell:

    python3 scripts/get_genome_matrix.py            --metadata data/metadata_nextstrain.tsv             --index-column code             --extra-columns country             --date-column date          --output outputs/genome_matrix_days.tsv

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message

Is there a recommendation to remedy this?

yanglingyue commented 3 years ago

I keep receiving a ValueError code. I've tried using the default values and the correct formatted values and there is no change in the error message.

snakemake correct_bias Building DAG of jobs... Using shell: /bin/bash Provided cores: 4 Rules claiming more threads will be scaled down. Job counts: count jobs 1 correct_bias 1 epiweek_conversion 1 genome_matrix 3

[Sun Sep 12 14:47:08 2021] Job 2: Generate matrix of genome counts per day, for each element in column="code"

  • Loading genome metadata
  • Converting code into codes (acronyms)
  • Removing genomes with incomplete dates
  • Filtering genomes by start and end dates
  • Available genomes

Traceback (most recent call last): File "scripts/get_genome_matrix.py", line 171, in print('\tOldest collected sampled = ' + df[date_col].min().strftime('%Y-%m-%d')) File "pandas/_libs/tslibs/nattype.pyx", line 60, in pandas._libs.tslibs.nattype._make_error_func.f ValueError: NaTType does not support strftime [Sun Sep 12 14:47:09 2021] Error in rule genome_matrix: jobid: 2 output: outputs/genome_matrix_days.tsv shell:

  python3 scripts/get_genome_matrix.py            --metadata data/metadata_nextstrain.tsv             --index-column code             --extra-columns country             --date-column date          --output outputs/genome_matrix_days.tsv

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message

Is there a recommendation to remedy this?

Hello, I have encountered the same problem, do you have any solution? If there is a solution, please let me know. Thank you!

vestalgd commented 3 years ago

@yanglingyue

I was advised to put the metadata header (i.e. country or division) as the index_column. I used "country" for the index_column and extra_columns for a global subsample and "division" for both for a USA focused subsample.

rule arguments: params: sequences = "data/gisaid_hcov-19.fasta", metadata = "data/metadata_nextstrain.tsv", case_data = "data/time_series_covid19_usa_reformatted.tsv", keep_file = "config/keep.txt", remove_file = "config/remove.txt", include_file = "config/strict_inclusion.tsv", drop_file = "config/batch_removal.tsv", index_column = "country", date_column = "date", baseline = "0.0001", refgenome_size = "29930", max_missing = "30", seed_num = "2007", start_date = "2020-01-01", end_date = "2021-01-01"

When I executed the --subsample program, I used index_column = "code", but had extra_columns = None. Make sure your FASTA headers match the metadata too. I took the hCoV-19/ off of the sequences and metadata and used --augur parse to remove the GISAID headers with the EPI_ISL and dates attached to the headers.

yanglingyue commented 2 years ago

Hi~

Thank you very much for your reply. I execute subsampler through the data given in the example I followed your suggestion and put "country" for index_column and extra_columns. When I executed snakemake correct_bias again, no error was reported, but when I executed snakemake subsample, I encountered a new error. How can I solve it? The LOG of error is shown below:

Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 16 Rules claiming more threads will be scaled down. Job counts: countjobs 1subsample 1

[Sat Nov 27 08:02:01 2021] Job 0: Sample genomes and metadata according to the corrected genome matrix

[Sat Nov 27 08:02:02 2021] Error in rule subsample: jobid: 0 output: outputs/sequences.fasta, outputs/metadata.tsv, outputs/sampling_stats.txt shell:

python3 scripts/subsampler_timeseries.py --sequences /home/tianhuaiyu/subsampler/data/sequences.fasta --metadata /home/tianhuaiyu/subsampler/data/metadata.tsv --genome-matrix outputs/matrix_genomes_epiweeks_corrected.tsv --max-missing 30 --refgenome-size 29930 --keep /home/tianhuaiyu/subsampler/config/keep.txt --remove /home/tianhuaiyu/subsampler/config/remove.txt --drop_list /home/tianhuaiyu/subsampler/config/batch_removal.tsv --include_list /home/tianhuaiyu/subsampler/config/strict_inclusion.tsv --seed 2007 --index-column country --date-column date --filter-column date --start-date 2020-01-01 --end-date 2021-01-01 --sampled-sequences outputs/sequences.fasta --sampled-metadata outputs/metadata.tsv --report outputs/sampling_stats.txt echo '# Sampling proportion: 0.0001' | cat - outputs/sampling_stats.txt > temp && mv temp outputs/sampling_stats.txt (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /home/tianhuaiyu/subsampler/.snakemake/log/2021-11-27T080201.504089.snakemake.log

At 2021-10-28 21:10:06, "vestalgd" @.***> wrote:

I was advised to put the metadata header (i.e. country or division) as the index_column. I used "country" for the index_column and extra_columns for a global subsample and "division" for both for a USA focused subsample.

rule arguments: params: sequences = "data/gisaid_hcov-19.fasta", metadata = "data/metadata_nextstrain.tsv", case_data = "data/time_series_covid19_usa_reformatted.tsv", keep_file = "config/keep.txt", remove_file = "config/remove.txt", include_file = "config/strict_inclusion.tsv", drop_file = "config/batch_removal.tsv", index_column = "country", date_column = "date", baseline = "0.0001", refgenome_size = "29930", max_missing = "30", seed_num = "2007", start_date = "2020-01-01", end_date = "2021-01-01"

When I executed the --subsample program, I used index_column = "code", but had extra_columns = None. Make sure your FASTA headers match the metadata too. I took the hCoV-19/ off of the sequences and metadata and used --augur parse to remove the GISAID headers with the EPI_ISL and dates attached to the headers.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.