MrTomRod / scoary-2

Calculate assocations between genes and traits
MIT License
19 stars 1 forks source link

"contains NaN" error when running Scoary2 on gene_presence_absence.csv from Roary #9

Closed ndusek closed 7 months ago

ndusek commented 7 months ago

We are trying to process a gene_presence_absence.csv file from Roary with Scoary2. Previously, we were using Scoary (v1) and were able to get results (albeit with a few errors in the log file), whereas with Scoary2, the exact same command is failing.

Here are the versions we are using for each of these packages:

Scoary: 1.6.16 Scoary2: 0.0.15 Roary: 3.13.0

Scoary (v1) results

Here is the command we have been using with Scoary (v1):

scoary --genes roary_output/85/gene_presence_absence.csv \
       --traits traits.csv \
       --outdir scoary1_test

The process completes successfully, although it does print the following error several times:

ERROR: Some isolates in your gene presence absence file were not represented in your traits file. These will count as MISSING data and will not be included.

But this does not prevent us from getting results for isolates that were not missing, so I consider this to be acceptable.

Scoary2 results

The Scoary2 usage guide suggests that we should be able to use the exact same command with the same inputs for Scoary2, so here is what we are running:

scoary2 --genes roary_output/85/gene_presence_absence.csv \
        --traits traits.csv \
        --outdir scoary2_test

This is failing with the following trace:

Loading traits...
Loading genes...
/home/ndusek/miniconda3/envs/scoary2/lib/python3.10/site-packages/scoary/load_genes.py:45: DtypeWarning: Columns (15,21,22,24,26,28,30,36,39,43,60,66,72,73,74,77,83,86,90,92,94,101,108,112,119,124,125,128,135,149,150,152,154,155,160,172,173,176,177,178,179,180,183) have mixed types. Specify dtype option on import or set low_memory=False.
  count_df = pd.read_csv(path, delimiter=delimiter, index_col=0)
Welcome to Scoary2! (0.0.15)
Traceback (most recent call last):
  File "/home/ndusek/miniconda3/envs/scoary2/bin/scoary2", line 8, in <module>
    sys.exit(main())
  File "/home/ndusek/miniconda3/envs/scoary2/lib/python3.10/site-packages/scoary/scoary.py", line 380, in main
    fire.Fire(scoary)
  File "/home/ndusek/miniconda3/envs/scoary2/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ndusek/miniconda3/envs/scoary2/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ndusek/miniconda3/envs/scoary2/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/ndusek/miniconda3/envs/scoary2/lib/python3.10/site-packages/scoary/scoary.py", line 132, in scoary
    genes_orig_df, genes_bool_df = load_genes(
  File "/home/ndusek/miniconda3/envs/scoary2/lib/python3.10/site-packages/scoary/load_genes.py", line 146, in load_genes
    genes_orig_df, genes_bool_df = load_gene_count_file(genes, delimiter, restrict_to, ignore)
  File "/home/ndusek/miniconda3/envs/scoary2/lib/python3.10/site-packages/scoary/load_genes.py", line 54, in load_gene_count_file
    assert not count_df.isna().values.any(), f'{path=}: contains NaN'
AssertionError: path='roary_output/85/gene_presence_absence.csv': contains NaN

The error contains NaN is clear enough, but I don't understand why Scoary2 would be complaining about this all of a sudden when the original Scoary had no problem with it.

Any idea what's going on here?

MrTomRod commented 7 months ago

I suspect you have to add the --gene-data-type argument, i.e., something like --gene-data-type 'gene-count:,' or --gene-data-type 'gene-list:\t' depending on how your gene_presence_absence.csv is formated.

If that doesn't solve your problem, you could mail me your dataset and I could have a look.

ndusek commented 7 months ago

@MrTomRod thank you for the quick reply!

Adding --gene-data-type 'gene-list:,' did indeed resolve the issue.

You might consider updating the usage guide for running Scoary2 on Roary output to include that flag, since I think that is the default output format for Roary. Just a suggestion...

MrTomRod commented 7 months ago

I added changed it to this:

# Dataset from Scoary 1: genes in Roary gene count format
scoary2 \
    --genes Gene_presence_absence.csv \
    --gene-data-type 'gene-count:,' \
    --traits Tetracycline_resistance.csv \
    --outdir out \
    --n-permut 1000
# If gene_presence_absence.csv is in gene-list format, use 
#   --gene-data-type 'gene-list:,'
# instead

Do you think that's clear enough?

ndusek commented 7 months ago

Yep, looks great to me!