Open beamandrew opened 6 years ago
Great rules. What do you mean with the third bullet? You often want to calculate statistical significance of your deep learning biomarker on the testing dataset. So statistics is often part of a deep learning analysis. Sorry if I missed the point
Can this be combined with #3 ?
I just wanted to point out that this suggestion by @beamandrew was cited as the source of:
"Deep learning really shines on unstructured, not structured data" in the current rules list, which I think is close to (but not exactly) the opposite of what Andy suggested.
I think what Andy was getting at is that deep learning has really thrived on specific problems when there is structure within the data and (I'd add) an architectural design crafted to exploit that type of structure. The canonical example is in images where you have tons of local structure that can be exploited via filters, convolutions, and then hierarchically combined up from there.
In this light, I might consider something closer to:
"Consider the structure in your data and how it relates to existing neural architectures"
I think this type of comment might be best looped into rule "#3 Know your data and your question" and/or "#4 Choose an appropriate architecture"
"Deep learning really shines on unstructured, not structured data" in the current rules list, which I think is close to (but not exactly) the opposite of what Andy suggested. ... The canonical example is in images where you have tons of local structure that can be exploited via filters, convolutions, and then hierarchically combined up from there.
I think the "unstructured" data refers more to the fact that data is in "raw" form (text, images) vs. tabular data (e.g., think of the Wisconsin breast cancer dataset where they extracted certain landmarks already, or even Iris, where the flowers are already measured -- today, with DL, you would basically run a convnet on the actual flower images).
If I understand you correctly, maybe the idea that the rules.md is getting at is more along the lines of: "DL thrives when it can leverage underlying local structure. As such, unprincipled preprocessing that discards the natural structure can do more harm than good." I agree with that.
However, if that's the case:
Yeah, sth along the lines of that. I.e., deep learning basically incorporates the representation learning instead of the traditional way of preprocessing the data into a chosen representation.
Top of my head, an example relevant for biology would be "Convolutional Networks on Graphs for Learning Molecular Fingerprints" (https://arxiv.org/pdf/1509.09292.pdf) examples from Ryan Adam's group. But also anything related to (DNA/amino acid) sequence and image analysis.
I don't think that "unstructured" is the right word: "Unstructured" is very different from "not preprocessed".
yeah, I agree, that's unfortunately one of these jargon terms in the field. The term really only makes sense in contrast to "tabular data"
EDIT:
This is not to say that DL doesn't benefit from preprocessing. Eg., pretty much everyone does image preprocessing in DL (and an png/jpg image is a preprocessed format in its own), e.g., for facial attribute classification, we usually align faces (based on eye and nose location) and center-crop them etc. So, maybe a better term would be "feature extraction" (but it is also a subcategory of preprocessing :P )
One last comment along this same spirit:
Currently we have rule number four as "Know your data and your question" -- I might bump this up to the first rule. It seems to me that the points describe in each of the issues that were combined into this rule apply just as much to using classic ML or building strong baseline models
It also seems like a good natural rule #1, since that's really where the whole journey starts.
Not all problems are equally amenable to deep learning. Make sure that there is a high prior probability that your problem will benefit from deep learning. Some useful rules of thumb:
Is there "structure" in your data. Data types such as imaging, text, time series, etc contain useful and regular kinds of structure and correlation that a deep learning model can more easily exploit relative to traditional models. If all you have is tabular data then DL is unlikely to provide much lift.
Do you have a lot of data? Deep learning is much more scalable (due to the ability to leverage GPU computing) and can take advantage of large datasets more easily.
Do you want to assess statistical significance? If so deep learning might not be the best fit.