ValueError Troubleshoot

vestalgd commented 3 years ago

I keep receiving a ValueError code. I've tried using the default values and the correct formatted values and there is no change in the error message.

snakemake correct_bias Building DAG of jobs... Using shell: /bin/bash Provided cores: 4 Rules claiming more threads will be scaled down. Job counts: count jobs 1 correct_bias 1 epiweek_conversion 1 genome_matrix 3

[Sun Sep 12 14:47:08 2021] Job 2: Generate matrix of genome counts per day, for each element in column="code"

Loading genome metadata
Converting code into codes (acronyms)
Removing genomes with incomplete dates
Filtering genomes by start and end dates

Available genomes

Traceback (most recent call last): File "scripts/get_genome_matrix.py", line 171, in print('\tOldest collected sampled = ' + df[date_col].min().strftime('%Y-%m-%d')) File "pandas/_libs/tslibs/nattype.pyx", line 60, in pandas._libs.tslibs.nattype._make_error_func.f ValueError: NaTType does not support strftime [Sun Sep 12 14:47:09 2021] Error in rule genome_matrix: jobid: 2 output: outputs/genome_matrix_days.tsv shell:

    python3 scripts/get_genome_matrix.py            --metadata data/metadata_nextstrain.tsv             --index-column code             --extra-columns country             --date-column date          --output outputs/genome_matrix_days.tsv

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message

Is there a recommendation to remedy this?

yanglingyue commented 3 years ago

I keep receiving a ValueError code. I've tried using the default values and the correct formatted values and there is no change in the error message.

snakemake correct_bias Building DAG of jobs... Using shell: /bin/bash Provided cores: 4 Rules claiming more threads will be scaled down. Job counts: count jobs 1 correct_bias 1 epiweek_conversion 1 genome_matrix 3

[Sun Sep 12 14:47:08 2021] Job 2: Generate matrix of genome counts per day, for each element in column="code"

Loading genome metadata

Converting code into codes (acronyms)

Removing genomes with incomplete dates

Filtering genomes by start and end dates

Available genomes

Traceback (most recent call last): File "scripts/get_genome_matrix.py", line 171, in print('\tOldest collected sampled = ' + df[date_col].min().strftime('%Y-%m-%d')) File "pandas/_libs/tslibs/nattype.pyx", line 60, in pandas._libs.tslibs.nattype._make_error_func.f ValueError: NaTType does not support strftime [Sun Sep 12 14:47:09 2021] Error in rule genome_matrix: jobid: 2 output: outputs/genome_matrix_days.tsv shell:
  python3 scripts/get_genome_matrix.py            --metadata data/metadata_nextstrain.tsv             --index-column code             --extra-columns country             --date-column date          --output outputs/genome_matrix_days.tsv

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message

Is there a recommendation to remedy this?

Hello, I have encountered the same problem, do you have any solution？ If there is a solution, please let me know. Thank you!

vestalgd commented 3 years ago

@yanglingyue

I was advised to put the metadata header (i.e. country or division) as the index_column. I used "country" for the index_column and extra_columns for a global subsample and "division" for both for a USA focused subsample.

rule arguments: params: sequences = "data/gisaid_hcov-19.fasta", metadata = "data/metadata_nextstrain.tsv", case_data = "data/time_series_covid19_usa_reformatted.tsv", keep_file = "config/keep.txt", remove_file = "config/remove.txt", include_file = "config/strict_inclusion.tsv", drop_file = "config/batch_removal.tsv", index_column = "country", date_column = "date", baseline = "0.0001", refgenome_size = "29930", max_missing = "30", seed_num = "2007", start_date = "2020-01-01", end_date = "2021-01-01"

When I executed the --subsample program, I used index_column = "code", but had extra_columns = None. Make sure your FASTA headers match the metadata too. I took the hCoV-19/ off of the sequences and metadata and used --augur parse to remove the GISAID headers with the EPI_ISL and dates attached to the headers.

yanglingyue commented 2 years ago

Hi~

Thank you very much for your reply. I execute subsampler through the data given in the example I followed your suggestion and put "country" for index_column and extra_columns. When I executed snakemake correct_bias again, no error was reported, but when I executed snakemake subsample, I encountered a new error. How can I solve it？ The LOG of error is shown below:

Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 16 Rules claiming more threads will be scaled down. Job counts: countjobs 1subsample 1

[Sat Nov 27 08:02:01 2021] Job 0: Sample genomes and metadata according to the corrected genome matrix

[Sat Nov 27 08:02:02 2021] Error in rule subsample: jobid: 0 output: outputs/sequences.fasta, outputs/metadata.tsv, outputs/sampling_stats.txt shell:

python3 scripts/subsampler_timeseries.py --sequences /home/tianhuaiyu/subsampler/data/sequences.fasta --metadata /home/tianhuaiyu/subsampler/data/metadata.tsv --genome-matrix outputs/matrix_genomes_epiweeks_corrected.tsv --max-missing 30 --refgenome-size 29930 --keep /home/tianhuaiyu/subsampler/config/keep.txt --remove /home/tianhuaiyu/subsampler/config/remove.txt --drop_list /home/tianhuaiyu/subsampler/config/batch_removal.tsv --include_list /home/tianhuaiyu/subsampler/config/strict_inclusion.tsv --seed 2007 --index-column country --date-column date --filter-column date --start-date 2020-01-01 --end-date 2021-01-01 --sampled-sequences outputs/sequences.fasta --sampled-metadata outputs/metadata.tsv --report outputs/sampling_stats.txt echo '# Sampling proportion: 0.0001' | cat - outputs/sampling_stats.txt > temp && mv temp outputs/sampling_stats.txt (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /home/tianhuaiyu/subsampler/.snakemake/log/2021-11-27T080201.504089.snakemake.log

At 2021-10-28 21:10:06, "vestalgd" @.***> wrote:

I was advised to put the metadata header (i.e. country or division) as the index_column. I used "country" for the index_column and extra_columns for a global subsample and "division" for both for a USA focused subsample.

rule arguments: params: sequences = "data/gisaid_hcov-19.fasta", metadata = "data/metadata_nextstrain.tsv", case_data = "data/time_series_covid19_usa_reformatted.tsv", keep_file = "config/keep.txt", remove_file = "config/remove.txt", include_file = "config/strict_inclusion.tsv", drop_file = "config/batch_removal.tsv", index_column = "country", date_column = "date", baseline = "0.0001", refgenome_size = "29930", max_missing = "30", seed_num = "2007", start_date = "2020-01-01", end_date = "2021-01-01"

When I executed the --subsample program, I used index_column = "code", but had extra_columns = None. Make sure your FASTA headers match the metadata too. I took the hCoV-19/ off of the sequences and metadata and used --augur parse to remove the GISAID headers with the EPI_ISL and dates attached to the headers.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

andersonbrito / subsampler

ValueError Troubleshoot #1