Closed andropar closed 2 years ago
@andropar Please add documentation about this new feature in the README.md
and our official docs. Looks good so far :) The default option should stay as what we have had before (i.e., extracting all features without partitioning them into subsets)
Will do after Philipp revamped the docs 👍
Base: 80.75% // Head: 79.94% // Decreases project coverage by -0.81%
:warning:
Coverage data is based on head (
6bec534
) compared to base (b4363f7
). Patch coverage: 63.41% of modified lines in pull request are covered.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
No specific reason, what do you think the default should be?
No specific reason, what do you think the default should be?
I am not sure. I was thinking of a scenario where there are not even 100 batches. Let's assume the batch size is set to 32 or 64. A default of 100 would mean that there are at least 3200 or 6400 images in the dataset. This is not a tiny dataset. However, for smaller datasets, it's probably ok to avoid the low memory option entirely and simply resort to the default. What do you think?
I thought about this too, its a tradeoff between batch size, number of batches, number of features and feature size. Maybe we can use some kind of heuristic as a default, e.g. assuming 8GB available RAM and then approximating how much would fit into memory?
I thought about this too, its a tradeoff between batch size, number of batches, number of features and feature size. Maybe we can use some kind of heuristic as a default, e.g. assuming 8GB available RAM and then approximating how much would fit into memory?
I like that idea. This sounds reasonable to me. We can probably even assume 16GB of available RAM. Every local machine these days has approx. 16 - 32 GB of RAM. The only problem with that approach might be the variability of the size of the activations across layers (early layers consume substantially more space than later layers). However, it's a trade-off as you've just said. We need to make a choice. I think this is a fine choice.
extractor.extract_features()
now takes extra argumentsoutput_dir
andsave_every
output_dir
is set, features will be written to disk everysave_every
batches (default: 100)