bioinfo-ut / PhenotypeSeeker

Identify phenotype-specific k-mers and predict phenotype using sequenced bacterial strains
GNU General Public License v3.0
18 stars 10 forks source link

Failure to make wordmap during modeling - gt4_wordmap_new: could not mmap file #9

Closed juanitagutierrez closed 5 years ago

juanitagutierrez commented 5 years ago

Hi there,

I am opening this issue, although a very similar one was solved before. It does not work for me yet, though. I am sorry to bring this problem back! I am just starting to run the modeling pipeline for PhenotpeSeeker, but I always get the following error messages:

Generating the k-mer lists for input samples: 10 of 10 lists generated. Generating the k-mer feature vector. Mapping samples to the feature vector space: gt4_wordmap_new: could not mmap file K-mer_lists/RD-1806-Ltenue-6-105-1_S1_L001_R1_001_13.list Error: Could not make wordmap from file K-mer_lists/RD-1806-Ltenue-6-105-1_S1_L001_R1_001_13.list! gt4_wordmap_new: could not mmap file K-mer_lists/RD-1806-Ltenue-8-34-3_S21_L007_R1_001_13.list Error: Could not make wordmap from file K-mer_lists/RD-1806-Ltenue-8-34-3_S21_L007_R1_001_13.list! gt4_wordmap_new: could not mmap file K-mer_lists/RD-1806-Ltenue-8-16-1_S22_L008_R1_001_13.list Error: Could not make wordmap from file K-mer_lists/RD-1806-Ltenue-8-16-1_S22_L008_R1_001_13.list! gt4_wordmap_new: could not mmap file K-mer_lists/RD-1806-8-50-1-thrum_S18_L006_R1_001_13.list Error: Could not make wordmap from file K-mer_lists/RD-1806-8-50-1-thrum_S18_L006_R1_001_13.list! gt4_wordmap_new: could not mmap file K-mer_lists/RD-1806-Ltenue-6-35-2_S24_L008_R1_001_13.list Error: Could not make wordmap from file K-mer_lists/RD-1806-Ltenue-6-35-2_S24_L008_R1_001_13.list! gt4_wordmap_new: could not mmap file K-mer_lists/RD-1806-8-8-2-Pin_S17_L006_R1_001_13.list Error: Could not make wordmap from file K-mer_lists/RD-1806-8-8-2-Pin_S17_L006_R1_001_13.list! gt4_wordmap_new: could not mmap file K-mer_lists/RD-1806-6-75-2-pin_S19_L007_R1_001_13.list Error: Could not make wordmap from file K-mer_lists/RD-1806-6-75-2-pin_S19_L007_R1_001_13.list!

Just a brief description of some previous steps I have run on my fastq files: I joined my forward and reverse samples using fastq-join, but it did not work. Then I read it could be a formatting issue, and so I converted them to fasta using seqtk seq from seqtk.

Any idea on what could be causing it? Thanks in advance!

erkiaun commented 5 years ago

Hello!

I was unable to reproduce the exact same error. I got only the "Error: Could not make wordmap from file K-mer_lists/..." lines if the input genome files were missing, faulty, empty or the paths in input file were incorrect.

It seems that in your case also for some reasons the "RD-1806-*.list" files are not correctly generated. I reccomend you to overcheck the input genome files (e.g. can't be .gz files) and also the paths in the input pheno file.

Also, could you tell me if the "RD-1806-*.list" files exist in the "K-mer_lists/" directory and if they do, what are their sizes and please try if the example script "example/test_PS_modelig.sh" works :)

You could also try out, if GenomeTester4 programs work, directly on your input genomes. For example: glistmaker "genome1..." -o genome1; glistmaker "genome1 genome2 genome3 ..." -o feature_vector; glistquery genome1_16.list -l feature_vector_13.list

The last command command is throwing the error in PhenotypeSeeker workflow, because genome1_16.list is not correctly generated as there is problem with the genome1 file or path.

juanitagutierrez commented 5 years ago

Hello, and thanks for your help!

I am running it on files that resulted from joining forward and reverse reads from each sample using fastq-join from qiime2. They look like any regular fastq file and I have used them in other programs that also require fasta or fastq formats. Do you have any suggestion on the best strategy when starting with paired-end raw reads? The data.pheno file is ok (i.e. files' location is correct).

Indeed, the problem starts with the generation of the lists. It starts "Generating the k-mer lists for input samples", but I can't find them in the K-mer_lists folder. Only *mapped.txt files are created but they are empty.

I have tried running the example as you suggested, but it fails to find phenotypeseeker. The location of the cloned repository is specified in my bash profile.

erkiaun commented 5 years ago

Hi!

PhenotypeSeeker can actually take multiple fastq files per sample, if they are specified using wildcard. For example "sample 1 ~/sample1/*.fastq 0; sample2 ~/sample2/*.fastq 1". If you can give both paired-end fastqs with single wildcard address, I reccomend to try this approach. Also, please reinstall phenotypeseeker before trying this approach, I very recently fixed some bugs, which raised when using wildcards in addresses.