bcgsc / btllib

Bioinformatics Technology Lab common code library
Other
23 stars 5 forks source link

Indexlr: spaces are not supported in file path #129

Open mxwang66 opened 2 weeks ago

mxwang66 commented 2 weeks ago

Hello devs,

I ran into this error running Indexlr in python on a linux x64 system. Below is the code to replicate this error:

from btllib import Indexlr, IndexlrFlag

assembly_path = 'genomes/Salmonella bongori/ncbi_dataset/data/GCF_000252995.1/genomic.fna'
kmerlen = 21
windowsize = 200
with Indexlr(assembly_path, kmerlen, windowsize, IndexlrFlag.LONG_MODE, 1) as assembly:
    for record in assembly:
        for m in record.minimizers:
            pass

and the error message:

cat: genomes/Salmonella: No such file or directory
cat: bongori/ncbi_dataset/data/GCF_000252995.1/genomic.fna: No such file or directory
[2024-11-05 01:39:10][WARNING] genomes/Salmonella bongori/ncbi_dataset/data/GCF_000252995.1/genomic.fna is empty.
[2024-11-05 01:39:10][ERROR] A helper process has finished unsuccessfully:
PID: 2313645
Outcome: exited with status 1
[2024-11-05 01:39:10][ERROR] Process pipeline: Communication failure.
[2024-11-05 01:39:10][ERROR] Process pipeline: Spawner process failed.
Segmentation fault (core dumped)

Got the same error when using the indexlr command:

indexlr -k 21 -w 200 --long -o test.out "genomes/Salmonella bongori/ncbi_dataset/data/GCF_000252995.1/genomic.fna"

Both run was successful if the space in the file path is replaced.

Thank you, Michael

lcoombe commented 2 weeks ago

Hi Michael,

Thanks for reaching out!

I'm not really surprised that you would see unexpected behaviour when trying to use files with spaces in them - generally, the convention is to avoid using spaces in file/directory names (I found some good documentation of conventions here: https://datamanagement.hms.harvard.edu/plan-design/file-naming-conventions). There would be a number of issues possible there, including that indexlr can take multiple files, and the spaces would make it complicated to distinguish different files vs. a single file with the space(s).

Because of all that, I'd highly recommend removing spaces in your file names. If that isn't a good option (I understand sometimes changing file names after the fact can be messy), then you could make a soft-link to that problem file, and specify that to indexlr instead. Ex:

(btl) [lcoombe@hpce706 tmp]$ ls my\ test/
tmp.fa
(btl) [lcoombe@hpce706 tmp]$ ls -lah tmp1.fa 
lrwxrwxrwx 1 lcoombe btl 14 Nov  5 08:52 tmp1.fa -> my test/tmp.fa
(btl) [lcoombe@hpce706 tmp]$ indexlr -k24 -w5 tmp1.fa 
NC_012920.1_1-50    16371557189216790589 3754088885574348785 9549692006826280120 11376364575436671136 17106421449899264663 2564125396831936452 6856120600046910211 636075098808042691

Thank you for your interest in btllib! Lauren

mxwang66 commented 2 weeks ago

Thanks for your response Lauren! Yeah the best practice would be having no spaces. But it would nice to make it supported, since python itself supports file path with spaces. But I understand that you might need to spend time on other features. Thanks again for maintaining btllib!

Michael

vlad0x00 commented 2 weeks ago

@lcoombe I think the issue can be resolved by surrounding path with quotes in https://github.com/bcgsc/btllib/blob/master/src/btllib/data_stream.cpp#L242-L299

At the start of the function, you can do const auto quoted_path = '"' + path + '"'; and then replace all usages of path in that function with quoted_path. Probably also worth checking that the path is not already quoted by checking if the first and last characters are quotes or apostrophes.

lcoombe commented 2 weeks ago

Thanks @vlad0x00!

That's really helpful - we can take a look at making that tweak (@parham-k or @jwcodee?) (We're a bit short-staffed, and I will be going on leave soon, so might not happen quickly :) )

parham-k commented 2 weeks ago

I can take care of it.

lcoombe commented 2 weeks ago

I can take care of it.

Thanks so much Parham!