Open agitter opened 5 years ago
I'm really interested to see if there is literature about this, one way or the other. I don't think there will end up being one rule other than "it depends on your data". If you're dealing with images (I'm sure @beamandrew can comment much more intelligently than I), skipping feature engineering is possible. On the other hand, making features biologically meaningful based on domain knowledge is one place that biologists can really add value.
This is highly domain specific and data modality specific. Definitely not true that feature engineering is still required in all domains. More true for problems with tabular input data into a neural network where you actually have to define features.
There is evidence that classical DNNs don't work very well with tabular data without new regularizers. https://arxiv.org/abs/1805.06440
It's probably better to rephrase this point as "data cleaning and preprocessing are extremely important" as compared to "feature engineering is important".
An additional reference regarding comparisons of neural networks with traditional methods in bio is
They found that with extensive optimization, neural nets outperform "shallow" methods, but only by a small margin. The input data was traditional fingerprint representation of molecules (which is then essentially preprocessed, tabular data)
Right. I think the idea is that neural networks are not necessarily the right tool when you already have informative features, like tabular data. Rather, neural networks are good at extracting informative features from raw, structured, data like images or sentences or nucleotide sequences.
NNs definitely not as developed for these areas, but -- and I assume we're aligned on this -- I would not be comfortable saying in a paper that NNs are not the right tool for these problems. There is some work that has shown, for example, that learning embeddings for structured data could provide a lot of lift on some problems. It's possible that these will win out but just haven't been totally established yet.
Deep learning can also be good for combining structure and unstructured data (e.g. sequences and tabular covariates for each example), which IMO can be difficult to do with other paradigms. This is especially nice for biological applications, since it allows a sort of compromise between powerful ML approaches leveraging vast amounts of unstructured data, and more interpretable mechanistic models.
Yes, I think this is what AlphaFold was doing and turned out to be successful. Basically, with DL we don't want to through "all" (domain) knowledge out the window but can leverage the best of both worlds :)
Probably best fits to this tip (or Intro) if we want to specifically mention this
Have you checked the list of proposed rules to see if the rule has already been proposed?
Did you add yourself as a contributor by making a pull request if this is your first contribution?
Feel free to elaborate, rant, and/or ramble.
Any citations for the rule? (peer-reviewed literature preferred but not required)
I have seen assertions that deep learning reduces or eliminates the need for feature engineering because the network itself constructs features from "raw" inputs. Most examples supporting this assertion come from imaging tasks.
I propose having a rule discussing whether or not this is true in biology, or perhaps the types of biological tasks for which feature engineering is still required. Unlike some of our other rules that are applicable to all DL or all ML in biology, this one would be specific to DL in biology.
I have some initial thoughts about feature engineering in DL but am hoping to gather a broad set of references before proposing the final rule.