Taming Pathogens - Githubissues

bkatiemills commented 10 years ago

Organizational Page: Pathogens
Proposer: Bill Mills
Field: Genomics (particularly WRT to pathogens)
Computing: Python
Looking for: back-end developers

Only very recently, a mechanism for “adaptive” immune system in bacteria against the viruses infecting them (usually referred to as phages) was discovered. It appears that bacteria can keep a dynamic library of small pieces of phage genomes (spacers) to detect and neutralize phage attacks. This discovery may give some insight in strategies for combating the pathogens that threaten humans worldwide.

Simultaneously, tremendous databases of phage and bacteria genomes have been made openly available via web forms and APIs from NCBI and phagedb. There is a great need and opportunity to build a simple, automated pipeline to extract the relevant data from these databases and build specialized phage datasets to expedite research on understnading and controlling pathogens.

Sidhartha Goyal from the University of Toronto has described a few simple goals for getting started on this project - find them in the issue tracker in our repo.

mbjones commented 10 years ago

@BillMills OK, tagged your project, which looks really interesting. Maybe you could list/link some of the issues you might want to tackle during codefest in this bug so people have an idea what the products might be?

bkatiemills commented 10 years ago

@mbjones sure, see below; more detailed descriptions are in our issue tracker, but here's a quick overview.

Data Reduction (great for Python beginners!): the datasets currently distributed by phagesdb.org have lots of information in them, but don't follow a format that's particularly amenable to use with any other package, and provide a large number of low-quality matches to genome searches we'd rather not keep. Our simplest task is just to extract the interesting data, and put it in a CSV file.
Genome Fetcher (beginner to intermediate): Once interesting phages are identified from phagesdb.org, we'd like to learn about the gene sequence surrounding the area of interest. In order to do this, we need to download the full genome from one of NCBI's databases. Command line tools exist to do this, but we'd like to automate the process by wrapping it in a python function, to be called from an automated pipeline.
Genome Extractor (beginner): Once we acquire the full genome from NCBI's databases, we need to extract the relevant part of the genome, corresponding to the matched area and its neighborhood; the location in the genome (expressed as just a simple number of characters along the string) will have been found in the original step, so here we just need to find that region and return it.

sckott commented 10 years ago

Sounds interesting @BillMills - I do love working with data via web APIs, so I'm interested to see if I can be of help

bkatiemills commented 10 years ago

Awesome, @sckott - we've got examples of how the APIs more or less work, and test data too, so we should be able to hit the ground running on this one.

svaksha commented 9 years ago

@BillMills, how can folks who are not physically present at the codefest participate? Do you have a developers mailing list?

bkatiemills commented 9 years ago

Hi @svaksha,

Thanks for getting in touch! The easiest thing for a remote contributor to jump on would be to attack the issues in the tracker: https://github.com/BillMills/phageParser/issues

If you have questions or comments, feel free to open issues there too - I'll have my eye on it for the rest of the conference.

We'll be using our etherpad to host conversation and notes, too: https://etherpad.mozilla.org/OSCF-pathogens

There's nothing on the etherpad yet, but once our session starts we'll all be using that to take notes. No need to wait for us though, jump into that issue tracker and let me know if there's anything you need!

svaksha commented 9 years ago

Hi @BillMills,

Thanks for the reply. I've cloned the repo and am trying to understand the requirements. For bug #1, https://github.com/BillMills/phageParser/issues/1, should the function grep the entire blast-phagesdb.txt file for all the Expect values between 0 and 1 or do you want separate functions for each of the Sequences producing significant alignments?

Also, I had a query about the copyrights. An MIT license is fine but the LICENSE file says 'Copyright (c) 2014 Bill Mills' <-- Will each developer have to sign away the copyrights to the code they wrote? https://github.com/NCEAS/open-science-codefest/wiki/Pathogens, states that you are organizing it on behalf of the Mozilla Science Lab, so why not assign copyrights to the MSL, Mozilla Foundation or some other Org? This part isnt clear, so please clarify.

Thanks, -SVAKSHA ॥ http://about.me/svaksha ॥

bkatiemills commented 9 years ago

Hi @svaksha ,

Good catch! I apologize for the confusion about the License copyright - that was autogenerated by GitHub with the license, and I forgot about it. Anyway, MSL does not in general claim rights to code contributed to third party projects, so I think I'll pass on assigning them that way; for now, I'm assigning the rights to the PI, pending further discussion.

Re: your first question, the former. We just wanted to scrape the entire file, and process some of the information from each match into CSV. There's a stab at this now as discussed in that issue, but it currently doesn't filter for match quality, and needs testing and validation.

NCEAS / open-science-codefest

Taming Pathogens #46