broadinstitute / catch

A package for designing compact and comprehensive capture probe sets.
MIT License
76 stars 16 forks source link

Having trouble accessing preloaded datasets #26

Closed lkothera closed 5 years ago

lkothera commented 5 years ago

Hi, novice Linux user here.

I work for the CDC and our scientific computing people have installed CATCH on our biolinux platform. I have loaded CATCH and was trying to run the line of code to have the program make probes for the installed Zika virus data set. I'm getting error messages that seem to say the .gz file can't be found, although if I move around the directories, I can see the .gz file that is supposed to be used to generate the probe designs.

Here is the line of code and the error messages: fph6@biolinux> design.py zika -pl 75 -m 2 -l 60 -e 50 -o zika-probes.fasta --verbose 2019-03-26 15:09:11,298 - catch.utils.seq_io [INFO] Reading fasta file /apps/x86_64/python/3.6.1/lib/python3.6/sit e-packages/catch-v1.2.0_20_gbf97305_dirty-py3.6.egg/catch/datasets/data/zika.fasta.gz Traceback (most recent call last): File "/apps/x86_64/catch/catch/bin/design.py", line 811, in main(args) File "/apps/x86_64/catch/catch/bin/design.py", line 60, in main genomes_grouped += [seq_io.read_dataset_genomes(dataset)] File "/apps/x86_64/python/3.6.1/lib/python3.6/site-packages/catch-v1.2.0_20_gbf97305_dirty-py3.6.egg/catch/utils /seq_io.py", line 71, in read_dataset_genomes seqs = list(read_fasta(fn).values()) File "/apps/x86_64/python/3.6.1/lib/python3.6/site-packages/catch-v1.2.0_20_gbf97305_dirty-py3.6.egg/catch/utils /seq_io.py", line 152, in read_fasta with gzip.open(fn, 'rt') as f: File "/apps/x86_64/python/3.6.1/lib/python3.6/gzip.py", line 53, in open binary_file = GzipFile(filename, gz_mode, compresslevel) File "/apps/x86_64/python/3.6.1/lib/python3.6/gzip.py", line 163, in init fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') FileNotFoundError: [Errno 2] No such file or directory: '/apps/x86_64/python/3.6.1/lib/python3.6/site-packages/cat ch-v1.2.0_20_gbf97305_dirty-py3.6.egg/catch/datasets/data/zika.fasta.gz'

Can you help? Thanks, Linda

haydenm commented 5 years ago

Hi Linda,

I'm sorry to hear about the issue. I haven't seen this before, and it's not obvious to me what the cause of the problem is if you're able to see that the .gz file is there. It may have something to do with how CATCH was installed on your platform.

Can you start by running ls -l /apps/x86_64/python/3.6.1/lib/python3.6/site-packages/catch-v1.2.0_20_gbf97305_dirty-py3.6.egg/catch/datasets/data/zika.fasta.gz and pasting the results, so I can see the file size? If the size is small, it may consist of only the hash and suggest the data has not been pulled via git lfs pull, although I'm not sure if this would yield the FileNotFoundError.

lkothera commented 5 years ago

Hi Hayden, Thanks for getting back to me. I am unable to get that line of code you pasted below to work. I typed it a couple of times and got “No such file or directory”. The way I think I can see the file size for the zika.fasta.gz file was to type

ls -l /apps/x86_64/catch/catch/catch/datasets/data

I did the cd command along the way. Not sure if that matters.

And the result line for Zika is

-rwxr-xr-x. 1 root root 764974 2019-03-26 13:53 zika.fasta.gz

There seems to be a lot to the path that the program wants to take. Let me know please if it’s something we need to fix on our end.

Thanks, Linda

Linda Kothera, PhD Ecology and Entomology Team Arboviral Diseases Branch Division of Vector-Borne Diseases Center for Emerging Zoonotic Infectious Diseases Centers for Disease Control and Prevention 3156 Rampart Road Fort Collins, CO 80521 970-225-4216 lkothera@cdc.gov

From: Hayden Metsky notifications@github.com Sent: Wednesday, March 27, 2019 9:25 AM To: broadinstitute/catch catch@noreply.github.com Cc: Kothera, Linda (CDC/DDID/NCEZID/DVBD) fph6@cdc.gov; Author author@noreply.github.com Subject: Re: [broadinstitute/catch] Having trouble accessing preloaded datasets (#26)

Hi Linda,

I'm sorry to hear about the issue. I haven't seen this before, and it's not obvious to me what the cause of the problem is if you're able to see that the .gz file is there. It may have something to do with how CATCH was installed on your platform.

Can you start by running ls -l /apps/x86_64/python/3.6.1/lib/python3.6/site-packages/cat ch-v1.2.0_20_gbf97305_dirty-py3.6.egg/catch/datasets/data/zika.fasta.gz and pasting the results, so I can see the file size? If the size is small, it may consist of only the hash and suggest the data has not been pulled via git lfs pull, although I'm not sure if this would yield the FileNotFoundError.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/broadinstitute/catch/issues/26#issuecomment-477207858, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ArpLx-fwe1ADg_JSu4H5rlbb5ZJUhYoxks5va41SgaJpZM4cMPcA.

haydenm commented 5 years ago

I'm not certain, but based on the path you provided (containing an egg) it looks like CATCH may have been installed by your team using _easyinstall, which I haven't used or tested. As noted in the README, I'd recommend pip -- in particular (but optionally), from within a virtual environment. Installing via conda is another option.

It looks like the design.py on your PATH is in a different directory than where the data lives. One quick fix might be to try running python /apps/x86_64/catch/bin/design.py zika ..., instead of design.py zika .... Can you let me know if that works?

lkothera commented 5 years ago

That does not seem to work. Here’s the code and the error:

fph6@biolinux> python /apps/x86_64/catch/bin/design.py zika -pl 75 -m 2 -l 60 -e 50 -o zika-probes.fasta --verbose python: can't open file '/apps/x86_64/catch/bin/design.py': [Errno 2] No such file or directory

I then did this (added an extra /catch to the line): fph6@biolinux> python /apps/x86_64/catch/catch/bin/ design.py zika -pl 75 -m 2 -l 60 -e 50 -o zika-probes.fasta –verbose

Which returned this after three lines of other output: /apps/x86_64/python/3.6.1/bin/python: can't find 'main' module in '/apps/x86_64/catch/catch/bin/'

I looked around for main and can’t find it. Here’s what’s in '/apps/x86_64/catch/catch/bin/'

fph6@biolinux> cd /apps/x86_64/catch/catch/bin fph6@biolinux> ls analyze_probe_coverage.py design_naively.py design.py pool.py

Linda

From: Hayden Metsky notifications@github.com Sent: Wednesday, March 27, 2019 11:46 AM To: broadinstitute/catch catch@noreply.github.com Cc: Kothera, Linda (CDC/DDID/NCEZID/DVBD) fph6@cdc.gov; Author author@noreply.github.com Subject: Re: [broadinstitute/catch] Having trouble accessing preloaded datasets (#26)

I'm not certain, but based on the path you provided (containing an egg) it looks like CATCH may have been installed by your team using easy_install, which I haven't used or tested. As noted in the READMEhttps://github.com/broadinstitute/catch#downloading-and-installing, I'd recommend pip -- in particular (but optionally), from within a virtual environment. Installing via condahttps://github.com/broadinstitute/catch#alternative-approach-installing-with-conda is another option.

It looks like the design.py on your PATH is in a different directory than where the data lives. One quick fix might be to try running python /apps/x86_64/catch/bin/design.py zika ..., instead of design.py zika .... Can you let me know if that works?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/broadinstitute/catch/issues/26#issuecomment-477276118, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ArpLx1SOls5mK7J-mVFOPfUTbZyCdWdwks5va65ugaJpZM4cMPcA.

haydenm commented 5 years ago

There's a space in python /apps/x86_64/catch/catch/bin/ design.py between bin/ and design.py. Does it work if you run it without that space?

lkothera commented 5 years ago

It does not seem to work when I do that.

fph6@biolinux> python /apps/x86_64/catch/catch/bin/ design.py zika -pl 75 -m 2 -l 60 -e 50 -o zika-probes.fasta -- verbose /apps/x86_64/python/3.6.1/bin/python: can't find 'main' module in '/apps/x86_64/catch/catch/bin/'

There is a space between python and /apps and another between bin/ and design.py.

Does it matter what directory I’m in?

From: Hayden Metsky notifications@github.com Sent: Wednesday, March 27, 2019 1:12 PM To: broadinstitute/catch catch@noreply.github.com Cc: Kothera, Linda (CDC/DDID/NCEZID/DVBD) fph6@cdc.gov; Author author@noreply.github.com Subject: Re: [broadinstitute/catch] Having trouble accessing preloaded datasets (#26)

There's a space in python /apps/x86_64/catch/catch/bin/ design.py between bin/ and design.py. Does it work if you run it without that space?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/broadinstitute/catch/issues/26#issuecomment-477308995, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ArpLx9EWtFuSP0568ZQryE7NkyKgG8cXks5va8KQgaJpZM4cMPcA.

lkothera commented 5 years ago

Geez. I misread your email. Hang on.

From: Hayden Metsky notifications@github.com Sent: Wednesday, March 27, 2019 1:12 PM To: broadinstitute/catch catch@noreply.github.com Cc: Kothera, Linda (CDC/DDID/NCEZID/DVBD) fph6@cdc.gov; Author author@noreply.github.com Subject: Re: [broadinstitute/catch] Having trouble accessing preloaded datasets (#26)

There's a space in python /apps/x86_64/catch/catch/bin/ design.py between bin/ and design.py. Does it work if you run it without that space?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/broadinstitute/catch/issues/26#issuecomment-477308995, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ArpLx9EWtFuSP0568ZQryE7NkyKgG8cXks5va8KQgaJpZM4cMPcA.

lkothera commented 5 years ago

OK, here’s the code and results without the space between bin/ and design.py

fph6@biolinux> python /apps/x86_64/catch/catch/bin/design.py zika -pl 75 -m 2 -l 60 -e 50 -o zika-probes.fasta --verbose 2019-03-27 16:10:50,320 - catch.utils.seq_io [INFO] Reading fasta file /apps/x86_64/python/3.6.1/lib/python3.6/sit e-packages/catch-v1.2.0_20_gbf97305_dirty-py3.6.egg/catch/datasets/data/zika.fasta.gz Traceback (most recent call last): File "/apps/x86_64/catch/catch/bin/design.py", line 811, in main(args) File "/apps/x86_64/catch/catch/bin/design.py", line 60, in main genomes_grouped += [seq_io.read_dataset_genomes(dataset)] File "/apps/x86_64/python/3.6.1/lib/python3.6/site-packages/catch-v1.2.0_20_gbf97305_dirty-py3.6.egg/catch/utils /seq_io.py", line 71, in read_dataset_genomes seqs = list(read_fasta(fn).values()) File "/apps/x86_64/python/3.6.1/lib/python3.6/site-packages/catch-v1.2.0_20_gbf97305_dirty-py3.6.egg/catch/utils /seq_io.py", line 152, in read_fasta with gzip.open(fn, 'rt') as f: File "/apps/x86_64/python/3.6.1/lib/python3.6/gzip.py", line 53, in open binary_file = GzipFile(filename, gz_mode, compresslevel) File "/apps/x86_64/python/3.6.1/lib/python3.6/gzip.py", line 163, in init fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') FileNotFoundError: [Errno 2] No such file or directory: '/apps/x86_64/python/3.6.1/lib/python3.6/site-packages/cat ch-v1.2.0_20_gbf97305_dirty-py3.6.egg/catch/datasets/data/zika.fasta.gz'

From: Hayden Metsky notifications@github.com Sent: Wednesday, March 27, 2019 1:12 PM To: broadinstitute/catch catch@noreply.github.com Cc: Kothera, Linda (CDC/DDID/NCEZID/DVBD) fph6@cdc.gov; Author author@noreply.github.com Subject: Re: [broadinstitute/catch] Having trouble accessing preloaded datasets (#26)

There's a space in python /apps/x86_64/catch/catch/bin/ design.py between bin/ and design.py. Does it work if you run it without that space?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/broadinstitute/catch/issues/26#issuecomment-477308995, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ArpLx9EWtFuSP0568ZQryE7NkyKgG8cXks5va8KQgaJpZM4cMPcA.

haydenm commented 5 years ago

Unfortunately, I think this is going to be tough to resolve given how it was installed. As I mentioned earlier, because of the egg file in the site-packages directory, I suspect that CATCH was installed using Distutils (python setup.py install) or with easy_install. I have not tested it this way, and can't recommend it. The basic problem, when installing this way, is that the installation is copying Python files into the egg file, but not the data -- and consequently the Python modules are unable to locate the data, which would normally be in the same directory structure. These installation methods should be fine if you do not plan to use the data distributed with CATCH, so you could alternatively move on to just use your own input FASTA files.

I think this will be easiest to resolve by asking your compute team if they could reinstall CATCH, using pip, as recommended in the README: via pip install -e . or pip install --user -e .. (Either way, the -e is needed to use the data distributed with the package.) It would also be helpful if they could run the test suite, as described in the README, to verify that everything is working correctly.

lkothera commented 5 years ago

Yes, it sounds like that is what’s needed. Thank you again for the assistance.

Also, I have a couple of questions about wet lab work from your recent paper. Would someone be able to help me with some details of the steps and reagents involved between using the hybridization probes and using the MiSeq reagent kit? If so can I get the proper contact info?

Linda

From: Hayden Metsky notifications@github.com Sent: Thursday, March 28, 2019 10:14 AM To: broadinstitute/catch catch@noreply.github.com Cc: Kothera, Linda (CDC/DDID/NCEZID/DVBD) fph6@cdc.gov; Author author@noreply.github.com Subject: Re: [broadinstitute/catch] Having trouble accessing preloaded datasets (#26)

Unfortunately, I think this is going to be tough to resolve given how it was installed. As I mentioned earlier, because of the egg file in the site-packages directory, I suspect that CATCH was installed using Distutils (python setup.py install) or with easy_install. I have not tested it this way, and can't recommend it. The basic problem, when installing this way, is that the installation is copying Python files into the egg file, but not the data -- and consequently the Python modules are unable to locate the data, which would normally be in the same directory structure. These installation methods should be fine if you do not plan to use the data distributed with CATCH, so you could alternatively move on to just use your own input FASTA files.

I think this will be easiest to resolve by asking your compute team if they could reinstall CATCH, using pip, as recommended in the READMEhttps://github.com/broadinstitute/catch/blob/master/README.md: via pip install -e . or pip install --user -e .. (Either way, the -e is needed to use the data distributed with the package.) It would also be helpful if they could run the test suite, as described in the README, to verify that everything is working correctly.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/broadinstitute/catch/issues/26#issuecomment-477665041, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ArpLxzIyDSMnhnvxLx4x_5LN1u19zHSBks5vbOpJgaJpZM4cMPcA.

haydenm commented 5 years ago

Yes, of course. Katie Siddle (kjsiddle@broadinstitute.org), my co-first author on the paper, is the right person to reach out to about those questions. Or you can email me (hayden@mit.edu) and I'll pass them along.

lkothera commented 5 years ago

Thank you!