PatBall1 / detectree2

Python package for automatic tree crown delineation based on the Detectron2 implementation of Mask R-CNN
https://patball1.github.io/detectree2/
MIT License
148 stars 35 forks source link

Unexpected Behavior of the to_traintest_folders Function with test_frac=0 and folds=2 Parameters #108

Closed mpcabete closed 1 year ago

mpcabete commented 1 year ago

I am encountering an unexpected behavior while using the to_traintest_folders function with the parameters test_frac=0 and folds=2. According to my understanding, when these parameters are set, the function should split all the .geojson files in the tiles directory into the training folds and a separate test fold. However, when I ran the function with 9 .geojson files in the tiles folder, it only placed one file in fold1 and one file in the test directory.

Steps to reproduce:

Set test_frac=0 and folds=2 as parameters in the to_traintest_folders function. Provide a directory with multiple .geojson files. Run the function. Expected behavior: I expect the function to split all the .geojson files in the tiles directory into two training folds and a separate test fold.

Actual behavior: The function only places one .geojson file in fold1 and one file in the test directory, leaving the remaining files unassigned.

Please let me know if I have misunderstood the expected behavior of the function, as this will help me pinpoint any potential issues in the code. Alternatively, if there is any guidance or solution you can provide, it would be greatly appreciated. Thank you for your assistance.

mpcabete commented 1 year ago

Upon further investigation, I have identified the source of the unexpected behavior in the to_traintest_folders function with the test_frac=0 parameter. In the tutorial, there is a note suggesting that removing overlapping tiles can be disabled by setting test_frac to 0:

The to_traintest_folders function automatically removes training/validation geojsons that overlap with test tiles, ensuring strict spatial separation of the test data. However, this can remove a significant proportion of the data available to train on so if validation accuracy is a sufficient test of model performance test_frac can be set to 0. Alternatively, just set a test_frac value that is smaller than you might otherwise have put.

However, upon examining the code snippet within the to_traintest_folders function, I noticed that the line if i <= len(file_roots) * test_frac is responsible for selecting the test tiles. Consequently, when i=0, it selects a test tile and subsequently removes any overlapping tiles using the is_overlapping_box function.

This behavior leads to the issue of only one .geojson file being placed in fold1 and one file in the test directory, as most of my tiles overlap.

PatBall1 commented 1 year ago

Hi @mpcabete thanks for identifying and digging into this issue. As you suggest, I think it would be resolved by changing if i <= len(file_roots) * test_frac to if i < len(file_roots) * test_frac. It might also be worth us having an option to allow test tiles to have some overlaps with the training/validation tiles to help when there is limited training data available. What kind of situation are you aiming to train/predict on? We may be able to share some additional helpful training data or a pre-trained model that hasn't yet been uploaded.

mpcabete commented 1 year ago

Hi @PatBall1 ,

Thank you for your response.

Regarding my project, I am working on training and prediction for tree crown delineation within an ecological station located in the Atlantic Forest region of Brazil. I would appreciate it if you could share the training data and the pre-trained model you mentioned.

Thank you for your attention and consideration.

Best regards, Mateus

PatBall1 commented 1 year ago

@mpcabete I have made the proposed changes so you should be able to work with your data. I have added some notes on the new behaviour here: https://patball1.github.io/detectree2/tutorial.html#preparing-data Please let me know if you have any issues with it. Please feel free to send me an email on ball.jgc@gmail.com to discuss the training data / pre-trained model.