Feature engineering is still (or is no longer) important

agitter commented 5 years ago

Have you checked the list of proposed rules to see if the rule has already been proposed?

[x] Yes

Did you add yourself as a contributor by making a pull request if this is your first contribution?

[x] Yes, I added myself or am already a contributor

Feel free to elaborate, rant, and/or ramble.

Any citations for the rule? (peer-reviewed literature preferred but not required)

Seeking relevant citations from other contributors

I have seen assertions that deep learning reduces or eliminates the need for feature engineering because the network itself constructs features from "raw" inputs. Most examples supporting this assertion come from imaging tasks.

I propose having a rule discussing whether or not this is true in biology, or perhaps the types of biological tasks for which feature engineering is still required. Unlike some of our other rules that are applicable to all DL or all ML in biology, this one would be specific to DL in biology.

I have some initial thoughts about feature engineering in DL but am hoping to gather a broad set of references before proposing the final rule.

Benjamin-Lee commented 5 years ago

I'm really interested to see if there is literature about this, one way or the other. I don't think there will end up being one rule other than "it depends on your data". If you're dealing with images (I'm sure @beamandrew can comment much more intelligently than I), skipping feature engineering is possible. On the other hand, making features biologically meaningful based on domain knowledge is one place that biologists can really add value.

akundaje commented 5 years ago

This is highly domain specific and data modality specific. Definitely not true that feature engineering is still required in all domains. More true for problems with tabular input data into a neural network where you actually have to define features.

There is evidence that classical DNNs don't work very well with tabular data without new regularizers. https://arxiv.org/abs/1805.06440

It's probably better to rephrase this point as "data cleaning and preprocessing are extremely important" as compared to "feature engineering is important".

rasbt commented 5 years ago

An additional reference regarding comparisons of neural networks with traditional methods in bio is

Koutsoukas, Alexios, et al. "Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data." Journal of cheminformatics 9.1 (2017): 42. https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0226-y

They found that with extensive optimization, neural nets outperform "shallow" methods, but only by a small margin. The input data was traditional fingerprint representation of molecules (which is then essentially preprocessed, tabular data)

jmschrei commented 5 years ago

Right. I think the idea is that neural networks are not necessarily the right tool when you already have informative features, like tabular data. Rather, neural networks are good at extracting informative features from raw, structured, data like images or sentences or nucleotide sequences.

sgfin commented 5 years ago

NNs definitely not as developed for these areas, but -- and I assume we're aligned on this -- I would not be comfortable saying in a paper that NNs are not the right tool for these problems. There is some work that has shown, for example, that learning embeddings for structured data could provide a lot of lift on some problems. It's possible that these will win out but just haven't been totally established yet.

evancofer commented 5 years ago

Deep learning can also be good for combining structure and unstructured data (e.g. sequences and tabular covariates for each example), which IMO can be difficult to do with other paradigms. This is especially nice for biological applications, since it allows a sort of compromise between powerful ML approaches leveraging vast amounts of unstructured data, and more interpretable mechanistic models.

rasbt commented 5 years ago

Yes, I think this is what AlphaFold was doing and turned out to be successful. Basically, with DL we don't want to through "all" (domain) knowledge out the window but can leverage the best of both worlds :)

fmaguire commented 5 years ago

Probably best fits to this tip (or Intro) if we want to specifically mention this

Benjamin-Lee / deep-rules

Feature engineering is still (or is no longer) important #60