LooseLab / readfish

CLI tool for flexible and fast adaptive sampling on ONT sequencers
https://looselab.github.io/readfish/
GNU General Public License v3.0
163 stars 31 forks source link

Adding additional .mmi extensions to all possible fasta file format l… #330

Closed mattloose closed 5 months ago

mattloose commented 5 months ago

…abels to catch acceptable file types.

This request addresses #326 and enables validation of the following file type endings.

['.fasta.gz', '.fna.gz', '.fsa.gz', '.fa.gz', '.fastq.gz', '.fq.gz', '.fasta.mmi', '.fna.mmi', '.fsa.mmi', '.fa.mmi', '.fastq.mmi', '.fq.mmi', '.fasta', '.fna', '.fsa', '.fa', '.fastq', '.fq', '.mmi']

I think a better way to handle this in the future would be to explicitly detemine if the file is readable by mappy. However, this will catch the most common possible mmi extension types.

alexomics commented 5 months ago

I don't think that the current fix addresses the problem in the issue. Which is an extra . in the filename (GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.mmi), so pathlib suggests that the suffix is .15_grch38_no_alt_analysis_set.fna.mmi

We could use something like:

file_extensions = ['.fasta', '.fna', '.fsa', '.fa', '.fastq', 
                   '.fq', '.fasta.gz', '.fna.gz', '.fsa.gz', 
                   '.fa.gz', '.fastq.gz', '.fq.gz', '.mmi']
if not any(index.lower().endswith(suffix) for suffix in file_extensions):
    raise ...

Probably should add a test case TOML for this too.

mattloose commented 5 months ago

I mean - it does fix it...

But I'll look at this alternative.

mattloose commented 5 months ago

OK It doesn't fix it.

Whoops.

mattloose commented 5 months ago

This addresses the issue raised by @alexomics (thanks - good spot).

I've attempted to add in a test toml but I don't think this is yet correct - advice from anyone appreciated.

alexomics commented 5 months ago

So the TOML tests are a bit of a Rube Goldberg machine, you need to:

  1. create a TOML file (readfish/tests/static/toml_validation_test/fail/005_bad_reference_file.TOML)
  2. create a matching TXT file (readfish/tests/static/toml_validation_test/fail/005_bad_reference_file.txt) that contains the expected error message(s) the TOML will generate

It's described a little bit more on the README for each test folder.

mattloose commented 5 months ago

re: the tests - the output of the readfish validate command won't error on this file now - it will just give a warning - can that be caught in a text file? I note that in the other examples for validating mappy the text file is just empty?

mattloose commented 5 months ago

OK - this adds a fail and pass test for these specific issues - the pass reference index file is named as per the issue that started this.

alexomics commented 5 months ago

Almost there! You seem to be missing a TOML file tests/static/mappy_validation_test/fail/004_bad_reference_file_extension.toml

mattloose commented 5 months ago

D'oh

Should be there now. I forgot I needed to force add the toml due to the gitignore!