Investigate object/boundary/edge detection ML algorithms

asmirn1 commented 3 years ago

Look into OpenCV for computer vision algorithms

du-lab commented 3 years ago

Progress:

Jerry has looked into openCV. It seems that the Haar Cascade algorithm does what we need to do for improving our peak detection. So we will start with this one.
Find training datasets to test Haar Cascade.

jerrychen04 commented 3 years ago

List of Algorithms:

jerrychen04 commented 3 years ago

https://blog.roboflow.com/how-to-train-yolov5-on-a-custom-dataset/

jerrychen04 commented 3 years ago

Evaluating performance of models using mAP and AP:

https://www.kdnuggets.com/2021/03/evaluating-object-detection-models-using-mean-average-precision.html

asmirn1 commented 3 years ago

TODO:

Run YOLOv5 locally using our GPU
Try YOLOv5 on the Trace training data

To apply YOLO to mzML data, we first need to convert that data into overlapping images and the corresponding labels.

Whiteboard 4 -01

asmirn1 commented 3 years ago

TODO:

Plot images with m/z values and retention times as labels on x and y axes
On the color bar, display real intensities (before scaling).

jerrychen04 commented 3 years ago

Profiled: https://drive.google.com/file/d/1bH_LAk3_2amxwpljrvoAU6GxyQZgAc3k/view?usp=sharing

jerrychen04 commented 3 years ago

Centroided: 2016-03-15_EP03_D11_cell-E2-2 on the study link

du-lab commented 3 years ago

Status:

The data that we have downloaded from the workbench using the link in the Trace paper seems to be centroided and we did not find profile data.
Have emailed Dr. Nemes for the profile data that was used for developing Trace.

TODO:

segment the DCSM profile data into overlapping blocks to be used for training the YOLO model.

asmirn1 commented 3 years ago

We tried YOLO on ~50 images and it seems to be working.

Next steps:

Create 50 new images by combining the ~24 peaks from Owen's paper and ~26 the most intense peaks detected by ADAP-BIG. Train YOLO on these new images.
Try different intensity scaling: no-scaling, square-root, and log-10.
Try images of bigger size.

asmirn1 commented 3 years ago

TODO:

Convert the local coordinates, width, height into actual m/z and retention time values
Finish the workflow and output a csv file with m/z value, retention time, intensity, m/z range, retention time range, confidence,
Remove all the unnecessary parts (axes, title, color bar, etc.) from images,
Make the image size smaller,
Remake the images and retrain the model.

asmirn1 commented 3 years ago

TODO:

Fix m/z and retention time ranges. Currently, some of them don't look right.
Add comments for your functions, explaining what those function do.
Fix the block construction algorithm. Currently, maximum m/z may not be included into blocks.
Prepare the new training data (618 peaks). Add retention time for each peak (up to one decimal point).

asmirn1 commented 3 years ago

Current issues:

Need to find all unique m/z values before constructing each block
Check peak with m/z=237.1449 and ret time = 10.51
Create two images that split a single peak by 50/50, and check what Yolo will detect.
Finish the new training dataset with 618 peaks.
Try different confidence levels and see how many peaks are detected in each case and what's the overlap with ADAP-BIG results in each case.

Next time, take another look at calculated m/z and retention time ranges.

jerrychen04 commented 3 years ago

https://www.quora.com/How-can-YOLO-compute-the-confidence-score-at-test-time-They-say-they-compute-it-as-P-object-IOU-But-during-test-time-you-dont-have-the-ground-truth-boxes-How-is-it-possible

du-lab commented 3 years ago

https://colab.research.google.com/drive/1Zfc0K-rSsAA366ymqeoTUNEbx3bATYP8?usp=sharing

This is the YoLo notebook.

asmirn1 commented 3 years ago

We have fixed all the bugs in the image generation and trained YOLO again. The precision 90.9%, recall 95.1%, mean average precision 94.8%. We've got about 450 peaks after processing 1/5 of the DCSM data file.

TODO:

You can reduce the image size, so that the algorithm would run faster. And retrain the YOLO.
Finish the workflow file adap_3d_main.py. The input for this script should be a raw data file, work directory, and confidence threshold. Users should run it like (use package argparse for this):
```
python adap_3d_main.py --file FILENAME --output CSV_FILENAME --work-directory FOLDER --confidence-threshold 25
```
The script should create intermediate data files in the work directory and output the peak table into the CSV file.
Process entire data file DCMS and compare results to ADAP-BIG. Plot the venn diagram with the number of common peaks and unique peaks for each method. When comparing peaks, use m/z tolerance 0.005 and retention time tolerance 0.1.

asmirn1 commented 3 years ago

Current results: we processed "entire" DCMS raw data file.

We've got 3874 peaks with confidence >80%. Of those, 3680 are also detected by ADAP-BIG. (ADAP-BIG detected a total of 16256 peaks)
We've got 10911 peaks with confidence >70%. Of those, 9713 are also detected by ADAP-BIG.

One issue with the new algorithm is that it currently takes very long time: about 6 hours to create images from a raw data file, and about 1.5 hours for perform the prediction.

To speed it up, we need to avoid creating thousands of images and saving them on disk. In order to do that, there at least two options:

Use parallel processing when generating images and writing them on disk.
Instead of saving each individual image on disk, we need to create a Python object containing all the images (or numpy arrays) and save that object on disk instead. Then, we'll need to modify detect.py and possible other classes to be able to read that object. The goal here is to create numpy arrays and feed them to YOLO without saving images on disk.
Write the YOLO neural network that works with our numpy arrays from scratch.

asmirn1 commented 3 years ago

@jerrychen04

We are still working on using numpy arrays instead of images. Let's finish this work.
Filter out the blocks where the maximum intensity is below 1000.

du-lab / Trace

Investigate object/boundary/edge detection ML algorithms #5