20n / act

Computational synthetic biology: Predicting DNA edits for bioengineering
http://20n.com
GNU General Public License v3.0
82 stars 26 forks source link

file ending of '01' ?? #2

Closed tentrillion closed 7 years ago

tentrillion commented 7 years ago

Thanks for publishing this code! It looks very interesting and I wanted to try it.

I installed TensorFlow, python, and all the other dependencies with conda on MacOSX, cloned the repo, and was able to run python bucketed_differential_deep.py -h successfully, so I assume TensorFlow, keras etc. is functioning.

I moved the faahKO data, which is in netCDF format, to the a lcms_datadirectory I made in the act/reachables/src/main/python/DeepLearningLcmsPeak directory, and then tried this:

python bucketed_differential_deep.py --control lcms_data/KO/ko15.CDF --experimental lcms_data/WT/wt15.CDF  --outputDirectory faahko_out/ --lcmsDirectory lcms_data

The result is something I don't completely understand about the proper ending of file names?

(20n) curt@DN2lk5k46:~/20n/act/reachables/src/main/python/DeepLearningLcmsPeak$ python bucketed_differential_deep.py --control lcms_data/KO/ko15.CDF --experimental lcms_data/WT/wt15.CDF  --outputDirectory faahko_out/ --lcmsDirectory lcms_data
Using TensorFlow backend.
/Users/curt/20n/act/reachables/src/main/python/DeepLearningLcmsPeak/bucketed_peaks/modules/lcms_autoencoder.py:174: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(units=70, activation="linear")`
  encoded = Dense(output_dim=first_layer_dim, activation="linear")(input_layer)
/Users/curt/20n/act/reachables/src/main/python/DeepLearningLcmsPeak/bucketed_peaks/modules/lcms_autoencoder.py:175: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(units=30, activation="linear")`
  encoded = Dense(output_dim=second_layer_dim, activation="linear")(encoded)
/Users/curt/20n/act/reachables/src/main/python/DeepLearningLcmsPeak/bucketed_peaks/modules/lcms_autoencoder.py:178: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(units=10, activation="linear")`
  encoded = Dense(output_dim=self.encoding_size, activation="linear")(encoded)
/Users/curt/20n/act/reachables/src/main/python/DeepLearningLcmsPeak/bucketed_peaks/modules/lcms_autoencoder.py:188: UserWarning: Update your `Model` call to the Keras 2 API: `Model(outputs=Tensor("de..., inputs=Tensor("in...)`
  encoder = Model(input=input_layer, output=encoded)
Traceback (most recent call last):
  File "bucketed_differential_deep.py", line 150, in <module>
    row_matrix1 = merge_lcms_replicates(experimental_samples)
  File "bucketed_differential_deep.py", line 110, in merge_lcms_replicates
    scans = [autoencoder.process_lcms_scan(lcms_directory, scan) for scan in samples]
  File "/Users/curt/20n/act/reachables/src/main/python/DeepLearningLcmsPeak/bucketed_peaks/modules/lcms_autoencoder.py", line 123, in process_lcms_scan
    "was {}".format(scan_file_name)
AssertionError: This module only processes MS1 data which should always have a file ending of '01'.  Your supplied file was lcms_data/WT/wt15.CDF
(20n) curt@DN2lk5k46:~/20n/act/reachables/src/main/python/DeepLearningLcmsPeak$

What is the right format for the filenames I want to supply to this code?

saurabh20n commented 7 years ago

Thanks for trying out the analysis.

That code was developed specifically over raw LCMS traces exported from Waters' instruments. Output in those cases were three files 01.nc, 02.nc, and 03.nc, of which the first was the real sample data and hence the 01 file name check.

Cannot make any claims about how the code is going to perform over your sample data. But please try.

I have pushed https://github.com/20n/act/commit/368170097e57b0d71e123c10903f2494e869587f -- changes that assertion to a warning. The changes have been merged into master, so pull.

Best of luck.

tentrillion commented 7 years ago

Thanks for the quick response. It will be some time before I can try to improved version, but I will report back when I do. In the mean time if any more documention or descriptions of this technique become available, let me know and I will check it out. The data set I want to try is the "standard" data that's been provided with xcms for nearly a decade, so I think it could be a useful way to compare your cool new ML approach with a tried-and-true, but aging, set of algorithms.

saurabh20n commented 7 years ago

Indeed. Side-by-side comparisons are good.

We collected 2400+ LCMS traces over our engineered organisms, and exclusively used this untargeted metabolomics pipeline to analyze the supernatants and pellets. We consistently got the right calls on both our pathway products and discovered side products. To evaluate against XCMS, we did run a short project comparing the outputs, and found this pipeline more robust.

Admittedly, we were not XCMS experts so could not hand tune the parameters to perfection, and it appears you might have the experience to do it properly. If you'd like, I can give you sample LCMS files from our runs. Wild-type vs Engineered; including information on the pathway products this code detects. You can try to make XCMS detect the same.

For the moment, I am going to close this issue. If you want to experiment with our data, please open an appropriate issue later and I'll be happy to assist.