cooperlab / ActiveLearning

Software and documentation from the active learning project on interactive classification.
5 stars 12 forks source link

Create new docker image #87

Closed slee172 closed 7 years ago

slee172 commented 7 years ago

Create new docker image for current HistomcisML.

choosehappy commented 7 years ago

great! i can confirm that this is indeed working as expected

at this stage i'm a bit confused how to actually add a dataaset into the system? i attempted to follow these directions [1] but am struggling to understand them.

if i have a cohort of tif images, how do i load them?

  1. https://raw.githubusercontent.com/cooperlab/ActiveLearning/master/doc/Creating_a_dataset_for_HistomicsML.txt
slee172 commented 7 years ago

Importing data into HistomicsML may not be a trivial issue for the users who are not familiar with the system because it is closely related to database, al_server demon, and paths to some directories. It will take a lot of times from the users. So, we are currently developing a data importing tab on the main menu so that the users can easily import their own data. We will also provide a documents for it.

choosehappy commented 7 years ago

awesome, i think that will definitely help people get more excited about using your software. i'm eager to try it!

cooperlab commented 7 years ago

@slee172 What is the ETA on the import tool?

slee172 commented 7 years ago

@cooperlab Almost done. But it will take couple of days to add other functions (generating data from features, analysis, graph, and etc.)

cooperlab commented 7 years ago

No hurry - just want to let @choosehappy know when to check back.

slee172 commented 7 years ago

Import tab is updated on HistomicsML #90, but we need to add this to docker image as well.

slee172 commented 7 years ago

Data import is updated on the docker image. Also, detailed descriptions are available on readthedocs.

choosehappy commented 7 years ago

Cool, I got the latest docker version and can in fact see the import tab. Also, I found the documentation you were mentioning here:

https://histomicsml.readthedocs.io/en/latest/data-import.html#importing-data-using-samples

I guess what is still unobvious to me is how to create the actual files for import? I'm guessing one would need to write scripts to generate the csv, txt and h5 files? I assume you already have something in house which can do that?

slee172 commented 7 years ago

@choosehappy as we have mentioned in here, dataset consists of slide information (.csv), feature information(.h5), boundary information(.txt), and whole slide image (.tif).

The slide information is formatted as "slide name,width in pixels,height in pixels,path to the pyramid on IIPServer,scale" for each slide. So, for example, you could do put a line like "TCGA-02-0010-01Z-00-DX4,32001,38474,/fastdata/pyramids/GBM/TCGA-02-0010-01Z-00-DX4.svs.dzi.tif,1".

The boundary information is formatted as "slide name \t centroid x coordinate \t centroid y coordinate \t boundary points". So, for example, using this, you could do write a file (e.g. GBM-boundaries.txt) formatted as like "TCGA-02-0010-01Z-00-DX4 2250.1 4043.0 2246,4043 2247,4043 2247,4042 2248,4042 2248,4040 2249,4040 2249,4039 2250,4039 2250,4038 2251,4038 2251,4037 2252,4037 2252,4035 2253,4035 2253,4033 2254,4033 2254,4032 2254,4033 2254,4033 2254,4039 2253,4039 2253,4040 2252,4040 2252,4043 2251,4043 2251,4047 2250,4047 2250,4050 2249,4050 2249,4052 2248,4052 2248,4053 2247,4053 2247,4054 2246,4054 2246,4055 2246,4054 2247,4054 2247,4050 2248,4050 2248,4044 2247,4044 2247,4043 2246,4043".

The feature information includes "dataIdx, features, mean, slideIdx, slides, std_dev, x_centroid y_centroid" under the root HDF5 file format. We have used c++ as a programming language to create HDF5 formatted file. This c++ codes run independently to create the file because this doesn't affect learning server. But, this could probably cause some issues when running it on different environment (e.g. different c++ libraries). So, I would recommend you to create a script for creating your own feature data. We added this c++ file here, so you can more details in the file as an example to create a HDF5 file.

For the file format, you can find the detailed information here. In addition, you can see more details in the sample file (GBM-features.h5) on the docker container (e.g. histomicsml/hmlWeb). For example, after moving the GBM-features.h5 in your preference location, run python to see the file inside (we have used h5py, a python package for HDF5 format). When you get the file inside, you could see more details about the contents in the file. Here are a sample script for you. ======================= start sample script to confirm HDF5 inside ================ import h5py file="GBM-features.h5" contents = h5py.File(file) for i in contents: print i

dataIdx features mean slideIdx slides std_dev x_centroid y_centroid

this will print out the feature information under the root of HDF5. for further step, if you want to see the details.

print f['features'][0] this will print out the feature information for the first object. [ -7.30991781e-01 -8.36540878e-01 -1.07858682e+00 9.26770031e-01 -9.31272805e-01 -4.36136842e-01 -1.13033086e-01 5.28297901e-01 6.85962856e-01 5.07918596e-01 -5.27561486e-01 -7.48096228e-01 -6.84849143e-01 -8.79032671e-01 -1.41368553e-01 -3.24195564e-01 -4.50991303e-01 -1.32366025e+00 9.17324543e-01 8.36400129e-03 -2.92657673e-01 2.01028720e-01 -1.93680093e-01 8.68237793e-01 5.72155595e-01 3.29810083e-01 -3.63551527e-01 -2.87026823e-01 -8.47819634e-03 -4.55458522e-01 1.43787396e+00 5.24487114e+00 -9.62561846e-01 5.94001710e-01 3.57634330e+00 -2.94562435e+00 -9.18125820e+00 2.87391472e+01 -9.34123135e+00 2.55983505e+01 -2.99653459e+00 -1.17376029e-01 -5.40324259e+00 1.01094952e+01 5.87054205e+00 6.21094942e+00 -2.59355903e+00 -4.27142763e+00]

======================== end sample script to confirm HDF5 inside ================

For the slide image, we have use vips.

cooperlab commented 7 years ago

@slee172 I think we should probably document the formats (particularly the .csv, .h5, and .txt) carefully on the readthedocs page. We can use some of the content from this file.

We can also refer people to the reference datasets within the Docker.

slee172 commented 7 years ago

Good idea. Added issue #100

choosehappy commented 7 years ago

From a new user perspective, I think you want to create a situation where people can very rapidly test out their own images easily.

For example, I have about 100 tif images containing nuclei, which should be ideal for your system and very easy to import into your system and try out. At this point I’ve probably invested about 20 hours total attempting to get this work, and while I understand that is the nature of research and development (and am happy to try things out) I don’t think everyone will fall into that market.

Given that new users haven’t already invested heavily in your approach, if the first step towards them trying out their own similar data is to sort through a bunch of documentation, figure out file formats, how to load things into databases, what features to generate, etc, etc, it becomes increasingly likely that people won’t make it through the process and adopt your technology.

You probably want something as simple as: import_my_data.sh *.tif

And from there people can dissect that (working and debugged) workflow for usage with their own non-similar datasets

Just my two cents.

cooperlab commented 7 years ago

@choosehappy You can take a look at the related projects HistXtract and HistomicsTK for code related to generating image analysis data. Our tool is not intended for generating the image analysis data - we created this tool with the assumption that people would come with their own data - having performed image segmentation and feature extraction on the tif images. The reason for this assumption is that image analysis methods are very sensitive to variations in histology and almost always have to be tuned for a specific dataset. There is no one size fits all approach that can be distributed that would provide push-button analysis for all users.

As digital pathology matures we hope that these analyses will be more of a commodity like next-gen sequencing. Our tool is an important step toward this goal though, allowing labs who who generate image analysis data to share it remotely with experts who have driving biological or medical questions that require machine learning analysis.

If you want to contact us by email then I would be happy to discuss your specific application.

choosehappy commented 7 years ago

I am of course in total agreement.

That said though, I still believe you’d get greater usage of your technology if a trivial pipeline was provided.

For example, I looked at the HistXtract that you mentioned, and it seems perfectly suited for this purpose. But unfortunately, its not obvious to me how I take a set of tif images, use HistXtract to generate features and files which are of the appropriate format and then subsequently import them into HistomicsML.

I think if you provided that pipeline, say including 3 small images as an example (maybe from https://histomicsml.readthedocs.io/en/latest/example.html#annotation-with-sample-dataset). Then people can use that to slightly modify it and copy->paste their own code in. Doesn't matter what the particular task is, or the features are (because as you said they are unique for everyones use case), but a door to door working example is very valuable.

That version would have the least barrier to entry. I believe that you likely already have all of that code, otherwise you wouldn't be able to use your own software :) Its just about documenting it and packaging it in a way that a 3rd party can quickly use it to get them interested and hooked

cooperlab commented 7 years ago

@choosehappy We would like to also engage people who aren't already running image analysis algorithms - there is also a barrier here that I didn't mention which is computing. Users would need a large computer to analyze even a single whole-slide image. Unfortunately our system is designed specifically for whole-slide images, and does not work with smaller TIFs or images obtained from a traditional microscope (we do not want to compete with ImageJ or CellProfiler who have this space well-covered).

This is all valuable feedback and we will keep thinking about ways to make the tools more accessible. What I am really excited about is enabling science and there is still clearly a large gap that needs to be spanned to accomplish this. We may try to create an executable around HistXtract and put that into a Docker but there are some licensing issues due to Matlab. At the very least we will make it clear in the documentation of HistomicsML that this doesn't provide the segmentation and feature extraction, and point to tools that do.

choosehappy commented 7 years ago

sounds good!

let me know if you need a beta-tester, i have a pile of svs images as well :)