bioinfo-ut / PhenotypeSeeker

Identify phenotype-specific k-mers and predict phenotype using sequenced bacterial strains
GNU General Public License v3.0
18 stars 10 forks source link

Failure to make wordmap during modeling #2

Closed Mishmash-su closed 5 years ago

Mishmash-su commented 6 years ago

This happens with both example data and my own data.

Running it on Ubuntu 14.04.4 LTS

phenotypeseeker modeling dtoxin-phenotypeseeker.pheno 
Compiled module "_likelihood_tree" not found.  Will use pure Python/NumPy likelihoodihood tree.
Compiled module "_pairwise_pogs" not found.  Will use slow Python alignment implementation.
Compiled module "_pairwise_seqs" not found.  Will use slow Python alignment implementation.
Compiled module "_compare" not found.  Will use slow Python dotplot.
Generating the k-mer lists for input samples:
    107 of 107 lists generated.
Generating the k-mer feature vector.
Mapping samples to the feature vector space:
Error: Could not make wordmap from file K-mer_lists/ID_13.list!
    107 of 107 samples mapped.
Estimating the Mash distances between samples...
Writing reference.msh...Traceback (most recent call last):
  File "/usr/local/bin/phenotypeseeker", line 280, in <module>
    Main()
  File "/usr/local/bin/phenotypeseeker", line 274, in Main
    args.func(args)
  File "/data1/home/msu/.local/lib/python2.7/site-packages/PhenotypeSeeker/modeling.py", line 1430, in modeling
    Samples.get_weights()
  File "/data1/home/msu/.local/lib/python2.7/site-packages/PhenotypeSeeker/modeling.py", line 288, in get_weights
    cls._distance_matrix_to_phyloxml(Input.samples.keys(), dist_mat)   
  File "/data1/home/msu/.local/lib/python2.7/site-packages/PhenotypeSeeker/modeling.py", line 340, in _distance_matrix_to_phyloxml
    dm = _DistanceMatrix(samples_order, distance_matrix)
  File "/usr/local/lib/python2.7/dist-packages/Bio/Phylo/TreeConstruction.py", line 306, in __init__
    _Matrix.__init__(self, names, matrix)
  File "/usr/local/lib/python2.7/dist-packages/Bio/Phylo/TreeConstruction.py", line 119, in __init__
    "'names' and 'matrix' should be the same size")
ValueError: 'names' and 'matrix' should be the same size
eacton commented 6 years ago

Hi! I reinstalled PhenotypeSeeker an hour ago after seeing this thread, and I am having a similar error with making the wordmaps. Any suggestions? My input was my own data with the following code:

phenotypeseeker modeling sample_sheet_assembled.tsv -l 13 -c 1 --min 2 --pvalue 0.05 -B --weights + --assembly + --num_threads 12

Further note: I also tried this removing the 4 offending samples with the same error minus the individual instances of wordmap fails, but still with the 'Error: Could not make wordmap from file K-mer_lists/samples_13.list!' and all errors starting with the ERROR: 'reference.msh'...

Generating the k-mer lists for input samples: 174 of 174 lists generated. Generating the k-mer feature vector. Mapping samples to the feature vector space: Error: Could not make wordmap from file K-mer_lists/samples_13.list! 8 of 174 samples mapped.Error: Could not make wordmap from file K-mer_lists/N10_13.list! 124 of 174 samples mapped.Error: Could not make wordmap from file K-mer_lists/N234_13.list! 133 of 174 samples mapped.Error: Could not make wordmap from file K-mer_lists/N49_13.list! 165 of 174 samples mapped.Error: Could not make wordmap from file K-mer_lists/N79_13.list! 174 of 174 samples mapped. Estimating the Mash distances between samples... ERROR: "reference.msh" exists; remove to write.Traceback (most recent call last): File "/usr/local/bin/phenotypeseeker", line 280, in Main() File "/usr/local/bin/phenotypeseeker", line 274, in Main args.func(args) File "/usr/local/lib/python2.7/dist-packages/PhenotypeSeeker/modeling.py", line 1421, in modeling Samples.get_weights() File "/usr/local/lib/python2.7/dist-packages/PhenotypeSeeker/modeling.py", line 286, in get_weights cls._distance_matrix_to_phyloxml(Input.samples.keys(), dist_mat)
File "/usr/local/lib/python2.7/dist-packages/PhenotypeSeeker/modeling.py", line 338, in _distance_matrix_to_phyloxml dm = _DistanceMatrix(samples_order, distance_matrix) File "/usr/local/lib/python2.7/dist-packages/Bio/Phylo/TreeConstruction.py", line 311, in init _Matrix.init(self, names, matrix) File "/usr/local/lib/python2.7/dist-packages/Bio/Phylo/TreeConstruction.py", line 122, in init "'names' and 'matrix' should be the same size") ValueError: 'names' and 'matrix' should be the same size

erkiaun commented 5 years ago

I improved the input file reading part of PhenotypeSeeker. It now enables more flexible format for header row and this should solve these errors.

eacton commented 5 years ago

Thanks @erkiaun! Works like a charm now.

eacton commented 5 years ago

Hi @erkiaun - I definitely got farther but I got the error global name 'cls' is not defined. I did not see this parameter in help menu and just have the argument --pvalue 0.05 (same code as above). Any suggestions?

Filtering the k-mers by p-value: Traceback (most recent call last): File "/usr/local/bin/phenotypeseeker", line 280, in Main() File "/usr/local/bin/phenotypeseeker", line 274, in Main args.func(args) File "/usr/local/lib/python2.7/dist-packages/PhenotypeSeeker/modeling.py", line 1334, in modeling Input.phenotypes_to_analyse.values() File "/usr/local/lib/python2.7/dist-packages/PhenotypeSeeker/modeling.py", line 1333, in lambda x: x.get_kmers_filtered(), File "/usr/local/lib/python2.7/dist-packages/PhenotypeSeeker/modeling.py", line 766, in get_kmers_filtered self.get_pvalue_cutoff(pvalues, nr_of_kmers_tested) File "/usr/local/lib/python2.7/dist-packages/PhenotypeSeeker/modeling.py", line 802, in get_pvalue_cutoff self.pvalue_cutoff = (cls.pvalue_cutoff/nr_of_kmers_tested) NameError: global name 'cls' is not defined

erkiaun commented 5 years ago

Hi @eacton! There was a little bug introduced in the Bonfferroni correction implementing part. Fixed this. I hope that you are now able to smoothly finish your analysis :)

eacton commented 5 years ago

Hi @erkiaun! Thanks for getting back so quickly. So I ran into another error (below) - but I did not run into this error yesterday with previous version. It looks like 'reference.msh' cannot be opened for reading because now it is not being produced. Any chance your update affected this?

Estimating the Mash distances between samples... ERROR: could not open "K-mer_lists/*.msh" for reading.ERROR: could not open "reference.msh" for reading. Traceback (most recent call last): File "/usr/local/bin/phenotypeseeker", line 280, in Main() File "/usr/local/bin/phenotypeseeker", line 274, in Main args.func(args) File "/usr/local/lib/python2.7/dist-packages/PhenotypeSeeker/modeling.py", line 1323, in modeling Samples.get_weights() File "/usr/local/lib/python2.7/dist-packages/PhenotypeSeeker/modeling.py", line 285, in get_weights cls._distance_matrix_to_phyloxml(Input.samples.keys(), dist_mat)
File "/usr/local/lib/python2.7/dist-packages/PhenotypeSeeker/modeling.py", line 337, in _distance_matrix_to_phyloxml dm = _DistanceMatrix(samples_order, distance_matrix) File "/usr/local/lib/python2.7/dist-packages/Bio/Phylo/TreeConstruction.py", line 311, in init _Matrix.init(self, names, matrix) File "/usr/local/lib/python2.7/dist-packages/Bio/Phylo/TreeConstruction.py", line 122, in init "'names' and 'matrix' should be the same size") ValueError: 'names' and 'matrix' should be the same size

erkiaun commented 5 years ago

Hi @eacton! It is a little bit strange error and I don't see that the last update may have caused this. It should work like yesterday and I cannot reproduce that error. Maybe check the PhenotypeSeeker created K-mer_lists folder to see if there actually are the *.msh files or not?

eacton commented 5 years ago

Hi @erkiaun - thanks. No the .msh files were not generated... the feature_vector.lists and the mapped.txt files and the .list files were generated in the Kmer folder. It's just weird b/c I didn't change anything else and I got to the chi square tests yesterday... Any idea why I am not generating .msh files?

erkiaun commented 5 years ago

Hi @eacton! I don't know exactly what is causing this. For some reason it failed to write mash sketch files .msh into K-mer-lists directory. Didn't you change anything in the input file also? Having the slashes "/" in the sample names could cause this error, because then, in some constructs, part of the sample name is interpreted as unix path. If that is not the case, I also wrote a simple test script, which you could try to run ("test_sketching.py phenotypeseeker-inputfile.tsv") and check if it succeeds to create .msh files to "K-mer-lists" or fails also (i had to add the ".txt" extension to script file to post it here). test_sketching.py.txt

eacton commented 5 years ago

Hey @erkiaun - thanks for the help! I ran your script and got the following:

ERROR: could not open "K-mer_lists/*.msh" for reading.ERROR: could not open "reference.msh" for reading.

An example of my first couple lines of my sample input file are as follows:

SampleID Assembly mortality_30day mortality_7day nosocomial
N1 1_careful_output_evalue_filtered.fasta 0 0 1
N10 10_181009_evalue_filtered.fasta 1 0 1

I do not have any slashes in my filenames. All other names are similar with only numbers changed.

erkiaun commented 5 years ago

Hi @eacton! It's my pleasure to solve all the mysterious errors related to PhenotypeSeeker! I thank you for your patience!

Actually, the script I sent flushed the original errors from screen. I attached another script which prints out all the errors. It seems that piping your fasta files to mash program does not work correctly. You could also try if this works or not with simple command "cat sample.fasta | mash sketch - -o samplename" test_sketching_v2.py.txt

Mishmash-su commented 5 years ago

My run finishes up fine now, or least the files are all created. It does throw an error at the end though.

phenotypeseeker modeling PS_modeling_example_files/data.pheno 
Compiled module "_likelihood_tree" not found.  Will use pure Python/NumPy likelihoodihood tree.
Compiled module "_pairwise_pogs" not found.  Will use slow Python alignment implementation.
Compiled module "_pairwise_seqs" not found.  Will use slow Python alignment implementation.
Compiled module "_compare" not found.  Will use slow Python dotplot.
Generating the k-mer lists for input samples:
    30 of 30 lists generated.
Generating the k-mer feature vector.
Mapping samples to the feature vector space:
    30 of 30 samples mapped.
Estimating the Mash distances between samples...
Calculating the Gerstein Sonnhammer Coathia weights from mash distance matrix...
Conducting the k-mer specific chi-square tests:
    100% of tests conducted.
Filtering the k-mers by p-value:
    100% of k-mers filtered.
Generating the logistic regression model...
Traceback (most recent call last):
  File "/usr/local/bin/phenotypeseeker", line 280, in <module>
    Main()
  File "/usr/local/bin/phenotypeseeker", line 274, in Main
    args.func(args)
  File "/data1/home/msu/.local/lib/python2.7/site-packages/PhenotypeSeeker/modeling.py", line 1450, in modeling
    assembling(kmers_passed_all_phenotypes, args.mpheno)
eacton commented 5 years ago

Hey @erkiaun - my errors from the second python script:

ERROR: Did not find fasta records in "-". cat: write error: Broken pipe Sketching from stdin...

And the 2nd try:

cat 1_careful_output_evalue_filtered.fasta | mash sketch - -o ./N1

Sketching from stdin... ERROR: Did not find fasta records in "-".

And yes, the fasta files are in my working directory that the sample sheet is in...

erkiaun commented 5 years ago

Hi @eacton! I rewrote this part of code to do the same thing without using piping. I hope the latest version in github does not throw this error anymore :)

erkiaun commented 5 years ago

@ViroMSu This is familiar error, but I fixed it some time ago. The latest version in github shouldn't throw this error. After generating the model, PhenotypeSeeker tries to assemble the k-mers used in the model to as long contigous sequences as possible. This error was related to this assembling part.

eacton commented 5 years ago

Hi @erkiaun! Everything worked! Thanks for your help. I'm looking forward to checking out the output!