bowmanjeffs / paprica

paprica - PAthway PRediction by phylogenetIC plAcement
27 stars 8 forks source link

paprica_build.sh unknown error #19

Closed karoraw1 closed 8 years ago

karoraw1 commented 8 years ago

My recent attempt to run paprica_build.sh failed after 8ish hours with the error shown below. Prior to the failed attempt, I ran paprica_build.sh and made it through the paprica_make_ref.py portion before the computation time limit kicked in at 12 hours (d'oh). I am trying it again now in a freshly cloned copy (with a 99 hour time limit... ) and if the same error pops up, I'll dig a bit deeper into the problem.

Traceback (most recent call last):
  File "paprica_make_ref_v0.20.py", line 447, in <module>
    rx = dist_cv_norm.loc[r1, r2]
  File "/cm/shared/apps/Intel/python/2.7.10/lib/python2.7/site-packages/pandas/core/indexing.py", line 1187, in __getitem__
    return self._getitem_tuple(key)
  File "/cm/shared/apps/Intel/python/2.7.10/lib/python2.7/site-packages/pandas/core/indexing.py", line 700, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "/cm/shared/apps/Intel/python/2.7.10/lib/python2.7/site-packages/pandas/core/indexing.py", line 850, in _getitem_lowerdim
    return getattr(section, self.name)[new_key]
  File "/cm/shared/apps/Intel/python/2.7.10/lib/python2.7/site-packages/pandas/core/indexing.py", line 1189, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "/cm/shared/apps/Intel/python/2.7.10/lib/python2.7/site-packages/pandas/core/indexing.py", line 1333, in _getitem_axis
    self._has_valid_type(key, axis)
  File "/cm/shared/apps/Intel/python/2.7.10/lib/python2.7/site-packages/pandas/core/indexing.py", line 1295, in _has_valid_type
    error()
  File "/cm/shared/apps/Intel/python/2.7.10/lib/python2.7/site-packages/pandas/core/indexing.py", line 1282, in error
    (key, self.obj._get_axis_name(axis)))
KeyError: 'the label [GCF_001020955.1] is not in the [index]'
bowmanjeffs commented 8 years ago

Can you confirm things are working after the taxtastic install? Thanks for the heads-up on dependencies, I've made that change in the manual. Let me know if you're still having issues - beware that I'm off the grid (on research vessel) for the next five days so replies will be delayed.

bowmanjeffs commented 8 years ago

After thinking about this for a moment longer I realized that it is unlikely to be related to your taxtastic install. Do you have an ftp site that you can post your 5mer_compositional_vectors.dist file to (should be located in ref_genome_database directory)?

karoraw1 commented 8 years ago

I posted it on my Pages site, if that works:

http://karoraw1.github.io/assets/5mer_compositional_vectors.dist2.bz2

karoraw1 commented 8 years ago

It failed with the same error after 3 hours of computation instead of 8 in the newly cloned repo.

bowmanjeffs commented 8 years ago

Okay, back on dry land but still traveling for next couple of days. Got the .dist file and I'll take a look. In the meantime, to get you moving forward, have you tried using the paprica_run.sh script and provided database?

bowmanjeffs commented 8 years ago

I'm mystified by this... 'GCF_001020955.1' is in the index of the file you sent:

test_df = pd.read_csv('5mer_compositional_vectors.dist2', index_col = 0) test_df.loc['GCF_001020955.1', 'GCF_001020955.1']

...returns 0.0, as it should. I can't complete a full build of the database from here to try and reproduce but I will at first chance. In the meantime can you try building the database after commenting out lines 440-450 (if you're comfortable doing that, let me know if not and I'll provide you a file)? This eliminates the selection of a random subset for curve fitting, but is not strictly necessary. If the script completes it will narrow down the possibilities. Note that for these subsequent runs you should have download = False on line 30. The script will still take some time to run, but at least you won't need to re-download all the genomes.

karoraw1 commented 8 years ago

I will give both a shot today and see what happens

bowmanjeffs commented 8 years ago

Okay, I was able to replicate the error. The problem happens when wget fails to download the faa file from a genome's directory on Genbank. It that case paprica finds the 16S gene and builds the 16S rRNA distance matrix but fails to build the compositional vector distance matrix. I'm working on a fix now. Note that this explanation is incompatible with my previous comment using "test_df", however, I'm hoping that that is the result of you trying paprica_build.sh multiple times.

karoraw1 commented 8 years ago

Yep, commenting out those lines did not help. I checked to see if the refseq directory was the same size across the different times I downloaded everything. It is off by about 66 Mb.

I would have noticed the difference in sizes earlier if I used du -s instead of du -hs... :cry:

How does it make a listing for genome in a new clone of the repository without fully pulling all the necessary files?

Thanks for the help. I really appreciate it.

bowmanjeffs commented 8 years ago

No prob, thanks for finding the error! I think I've fixed it, trying to build the database now. Will see if it completes without error. The way paprica executes wget is that it first downloads a csv file containing summary information for all the genomes in Genbank, including the ftp path for each genome's files. This allows tight control over things like sequencing status. Wget simply loops over the ftp path column in the summary file dataframe. Wget is set to try each download 30 times, however, if there is some ftp glitch that prevents download after 30 tries it simply moves on. Since the downstream processes all use the same summary file dataframe, they will go looking for each of the genomes it contains. The fix checks that each download actually happened and eliminates the genome from the summary file if it did not. For a future release I'll try to build in a more sophisticated mechanism for checking and retrying the downloads so as to not lose those genomes. If the fix works I'll post it as soon as the test is complete (which as you know takes a bit...).

bowmanjeffs commented 8 years ago

Okay, test worked. I've attached an updated version of the file here. I want to finish testing the whole pipeline before uploading the new version to the repository. paprica_make_ref_v0.21.zip

bowmanjeffs commented 8 years ago

Issue persists. Script was iterating across directories to find 16S rRNA gene sequences, so previous fix had no effect. Changed to iterated across summary_complete.index. This should restrict 16S rRNA gene search to only those genomes that had successful downloads of fna and faa.

bowmanjeffs commented 8 years ago

Okay... think I got it this time. This created a new issue downstream, so paprica_build_core_genomes is currently producing blank dataframes, but the current version of paprica_make_ref should allow you to finish testing that part of paprica_build.sh. Should hopefully have everything operational COB tomorrow.