Improving DragoNN integration

rbharath commented 6 years ago

Basic integration of DragoNN models with DeepChem contrib was just merged in #979. There's a good chunk of work that will have to be done to improve integration. Let's use this issue to coordinate work.

Here are some potential TODOs:

simulations.py, the file that generates synthetic DragoNN training data was just removed from the simdna library (https://github.com/kundajelab/simdna/issues/4). We could give these simulation datasets a good home in DeepChem (somewhere in dc.moleculenet).
Performance matching: The SequenceDNN class right now is very crude, and doesn't have the bells and whistles of the associated DragoNN class. Some performance tuning work will have to be done to match reported DragoNN numbers.
~MotifRNN implementation~ gkmSVM implementation: This class hasn't yet been ported into DeepChem. (See discussion below)
PSSM scores (https://en.wikipedia.org/wiki/Position_weight_matrix) are standard for bioinformatic visualizations. Some integration into dc.metrics would be very useful.
FASTA file format support (https://en.wikipedia.org/wiki/FASTA_format)
(Optional) Add visualization of PSSM scores as generated by the DRAGONN tutorial.

If you think you're interested in helping with the effort, please chime in on this thread.

CC @jisraeli: Would love your feedback on other ways we can improve integration.

LRParser commented 6 years ago

Any tips on how to install the GTC notebook pre-reqs? I get a missing package error in GTC_workshop_tutorial (launched jupyter from the root directory of the repo and navigated thru to contrib/dragonn in browser):

ModuleNotFoundError Traceback (most recent call last)

in () ----> 1 from simulations import simulate_motif_density_localization 2 print(simulate_motif_density_localization.__doc__) ~/deepchem-fork/contrib/dragonn/simulations.py in () 2 from collections import OrderedDict 3 import numpy as np ----> 4 import simdna 5 from simdna.synthetic import ( 6 RepeatedEmbedder, SubstringEmbedder, ReverseComplementWrapper, ModuleNotFoundError: No module named 'simdna' When I try to install dragonn via conda I get an error as my DeepChem envt is on 3.5: (deepchem-fork) joe@powerspec:~/deepchem-fork/scripts$ conda install dragonn -c kundajelab Fetching package metadata ................... Solving package specifications: . UnsatisfiableError: The following specifications were found to be in conflict: - dragonn -> deeplift ==0.3 -> python 2.7* - python 3.5* Any suggestions?

jisraeli commented 6 years ago

I can post installation instructions tomorrow. How about a 'dragonn' conda environment file to get past prereq conflicts?

-J

On Wed, Jan 3, 2018, 11:57 AM Joe notifications@github.com wrote:

Any tips on how to install the GTC notebook pre-reqs? I get a missing package error in GTC_workshop_tutorial (launched jupyter from the root directory of the repo and navigated thru to contrib/dragonn in browser):

ModuleNotFoundError Traceback (most recent call last) in () ----> 1 from simulations import simulate_motif_density_localization 2 print(simulate_motif_density_localization.doc)

~/deepchem-fork/contrib/dragonn/simulations.py in () 2 from collections import OrderedDict 3 import numpy as np ----> 4 import simdna 5 from simdna.synthetic import ( 6 RepeatedEmbedder, SubstringEmbedder, ReverseComplementWrapper,

ModuleNotFoundError: No module named 'simdna'

When I try to install dragonn via conda I get an error as my DeepChem envt is on 3.5:

(deepchem-fork) joe@powerspec:~/deepchem-fork/scripts$ conda install dragonn -c kundajelab Fetching package metadata ................... Solving package specifications: .

UnsatisfiableError: The following specifications were found to be in conflict:

dragonn -> deeplift ==0.3 -> python 2.7*

python 3.5*

Any suggestions?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deepchem/deepchem/issues/1002#issuecomment-355110784, or mute the thread https://github.com/notifications/unsubscribe-auth/AFprOeEfeRogHd7TANXYwXqV46UWpmOLks5tG9u7gaJpZM4RRSrq .

rbharath commented 6 years ago

@jisraeli A custom conda file would be well appreciated!

Eventually, I'd like to move the Dragonn support into the main library, but that will take time since we'll have to introduce dependencies carefully.

rbharath commented 6 years ago

@LRParser I think I did some manual python 2.7 -> 3.5 conversions at some point. My recommendation is to just keep messing with it till things start to work. It wasn't too bad. We'll need to figure this stuff out before we can move support into deepchem proper though.

mlgill commented 6 years ago

I'm happy to help out as well. I'm new to DragoNN, but have started some work with DeepChem. I've built plenty of conda packages in my day, so I could help there if appropriate.

I've also been working with DeepChem using Dockerfiles (python 2.7 and 3.5) that I built which use NVIDIA optimized tensorflow but also add in cairocffi (better looking RDKit molecule images) and requirements for pyGPGO, which don't seem to be in the latest Docker container. GitHub repo here.

Happy to work on adding DragoNN and the other additional libraries to DeepChem's provided Docker container.

Probably other ways I can help out as well as I get more familiar with both libraries.

mlgill commented 6 years ago

Also, I think biopython will read FASTA files. Would there be interest in integrating that or do we feel there is a need to create our own parser?

rbharath commented 6 years ago

@mlgill Great to hear you're interested in helping!

I think adding biopython integration would be great. My sense is biopython:bioinformatics as rdkit:cheminformatics. We already depend on rdkit quite a bit, so makes sense to use biopython as a complement.

My recommendation would be to get comfortable with deepchem development by submitting a small starter PR. Once you get the hang of our style, it should be relatively straightforward to figure out a design for the biopython support.

jisraeli commented 6 years ago

Re fasta reading - there is a function in dragonn that takes in fasta filename and returns numpy array with one hot encoded sequences: https://github.com/kundajelab/dragonn/blob/master/dragonn/utils.py#L121

jisraeli commented 6 years ago

MotifRNN should probably the last on the priority list - we haven't used it in practice in years.

LRParser commented 6 years ago

Johnny - install tips would be much appreciated!

jisraeli commented 6 years ago

@LRParser python2 or python3?

rbharath commented 6 years ago

@jisraeli Would love tips on other models (if any) we should implement besides SequenceDNN.

jisraeli commented 6 years ago

@rbharath gkmSVM is useful for benchmarking the svm models that used to be SOTA: https://github.com/kundajelab/dragonn/blob/master/dragonn/models.py#L336. The implementation in DragoNN assumes this dependency: https://github.com/Dongwon-Lee/lsgkm.

rbharath commented 4 years ago

There aren't plans to integrate with DragoNN at present so closing. Will re-open if there's interest.

deepchem / deepchem

Improving DragoNN integration #1002

Any tips on how to install the GTC notebook pre-reqs? I get a missing package error in GTC_workshop_tutorial (launched jupyter from the root directory of the repo and navigated thru to contrib/dragonn in browser):