MicroMedIAn / PathAIA

Digital Pathology Analysis Tools
GNU General Public License v3.0
3 stars 3 forks source link

Start clear ml #28

Closed ArnaudAbreu closed 3 years ago

ArnaudAbreu commented 3 years ago

I think we also need common ways to build datasets and to make them as reliable as possible as many ML mistakes in our projects come from the building of these structures.

Start the 'datasets' subpackage. 'DataSets' types are provided. Basically, we add a RefDataSet struct that is a tuple (x, y) where x is a list of samples and y is the corresponding list of labels. We provide a functional_api in pathaia.datasets that includes the following features:

I have a test-script for the features described above that I will add to the testing procedures of PathAIA in an other PR.

Here is a code sample on how dataset decorators can be used to yield samples:

from pathaia.datasets import (
    clean, shuffle, balance, clip, batch, info, split
)
# First, split into training, testing.
# We do it first to keep the following processes independent.
@split({"training": 0.8, "test": 0.2})
# Then, we clear the subsets by removing wrong label types and values.
@clean(dtype=str, rm=["UNCLASSIFIED"])
# We can then balance the dataset (samples from minor classes are duplicated).
@balance
# Shuffle the sets of course.
@shuffle
# We can set a max number of samples to yield if the dataset is too big.
@clip(337)
# We can yield batches of data.
@batch(13)
# Finally, we define a very simple function to yield samples and labels
# and each decorator above will be applied before the yield.
# Order of the decorators is crucial, they are executed top to down.
def fair_split_named_loop(ds):
    xds, yds = ds
    for x, y in zip(xds, yds):
        yield x, y

Of course, the use of the decorators is only optional, you can as well call the corresponding functions. I guess it depends on the case. But in many of my scripts, I find that having that type of syntax is helpful.

object_api will come with Cohort objects to wrap all these functions.

codecov[bot] commented 3 years ago

Codecov Report

Merging #28 (aa88e29) into master (7fc4b6f) will decrease coverage by 7.43%. The diff coverage is 1.80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #28      +/-   ##
==========================================
- Coverage   38.15%   30.71%   -7.44%     
==========================================
  Files          15       18       +3     
  Lines         865     1084     +219     
==========================================
+ Hits          330      333       +3     
- Misses        535      751     +216     
Flag Coverage Δ
unittests 30.71% <1.80%> (-7.44%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pathaia/datasets/__init__.py 0.00% <0.00%> (ø)
pathaia/datasets/data.py 0.00% <ø> (ø)
pathaia/datasets/errors.py 0.00% <0.00%> (ø)
pathaia/datasets/functional_api.py 0.00% <0.00%> (ø)
pathaia/patches/functional_api.py 45.84% <ø> (ø)
pathaia/util/management.py 0.00% <0.00%> (ø)
pathaia/util/types.py 97.56% <100.00%> (+0.09%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7fc4b6f...aa88e29. Read the comment docs.