[REVIEW] JOSS feedback - Githubissues

earth-chris commented 8 months ago

Hi @chenyangkang,

I'm providing my review of stemflow here in this issue. I don't have many detailed software issues, since the package installs correctly and works as demonstrated in the documentation, for what I've tested. I have a few other comments that I believe will improve user experience.

First, thank you for preparing this package. It is a valuable contribution to the community. The goals of my comments are to make it easier for others to use this package with their own data. It is very easy to run the examples provided using pre-processed data, but there appears to be a series of assumptions regarding how these data should be formatted that are not clear.

Since this package is designed around spatial and temporal data, it would help to clarify how time and space are encoded in the model. Do input dataframes require a DOY column? Does anything change if data are provided at other temporal scales (weekly, monthly, yearly, etc.)? I see latitude and longitude encoded in the mini test data - is that the only supported CRS? I see that geopandas is a dependency - does the model class support passing a GeoDataFrame, or do you need to explicitly encode column names?

Next, I think more guidance regarding feature data could be provided. I was assuming most of the input covariates would be datasets that are temporally resolved at a similar scale as the abundance data (e.g. daily NDVI). But it looks like the mini_data example is nearly all static features, with DOY being the only dynamic variable. Is this how other datasets are expected to be formatted? Can users provide a combination of static and dynamic covariates? Are categorical features supported? More clarification regarding best practices for how to extract and format covariate data would help a lot.

The example notebooks provide clear usage examples, but there is little to no explanation of why certain routines are performed. There are titles for groups of code, but little contextual information. In the intro notebook, for example, why were the parameters Spatio_blocks_count=50, Temporal_blocks_count=50 selected? what would turning these numbers up and down do? What are the grid_len_{}_{}_threshold parameters, and what do those defaults encode? Tracing back to the original function is often still challenging, as there are many parameters with slightly varying names that are still hard to understand. As a user I would appreciate more narrative clarification.

There are lots of great features in this package and in the documentation that I don't feel like I understand as well as I would like to. I'm not familiar with the Hurdle modeling approach, which seems great and appears foundational to AdaSTEM - tips on best practices here would be valuable. You mentioned using other base models from sklearn in the manuscript - when would this be advantageous? You also provide great examples for comparing learning curves and optimizing strixel sizes, but these notebooks are just code blocks with some plots and little interpretation. I found myself wanting more guidance for how to interpret these results, or even guidance on why such optimization is important (do you optimize strixel size to minimize overfitting, for example?).

Overall, I think this is a great package and will make a valuable contribution to the growing ecosystem of python biogeography tools. I mostly recommend providing more guidance to users for how to best use this valuable resource.

Cheers,

chenyangkang commented 8 months ago

Hi Chris,

Thank you for your valuable comments. I agree on most of the issues you mentioned. Here are the replies and solutions for them:

About how time and space are encoded:

Do input dataframes require a DOY column?

No, the target spatial and temporal indexing columns are flexible, and could be passed as arguments.

Does anything change if data are provided at other temporal scales (weekly, monthly, yearly, etc.)?

No, the scale of the indexing is highly flexible and user-defined.

I see latitude and longitude encoded in the mini test data - is that the only supported CRS?

No. Any indexing system could be passed to AdaSTEM framework. If user wish to use other CRS, they could transform it in prior.

I see that geopandas is a dependency - does the model class support passing a GeoDataFrame, or do you need to explicitly encode column names?

This is a mistake. geopandas is not a dependency of stemflow. I removed them from requirement, for both pypi and conda. Currently we only support pandas dataframe as training data input (there is a type check). Geopandas dataframe can be tranfomed to pd.core.frame.DataFrame beforehand.

All these issues are documented in the new documentation page Tips for spatiotemporal indexing, Tips for data types as well as listed as features in the home page (section Model and data).

Relevant PR(s): #14

chenyangkang commented 8 months ago

Thank you for your advice on more guidance for feature data. Here comes the replies and solutions:

But it looks like the mini_data example is nearly all static features, with DOY being the only dynamic variable. Is this how other datasets are expected to be formatted? Can users provide a combination of static and dynamic covariates?

This is a great question. You are right, the datasets are formatted with static features except for DOY. This is for specific reasons. Including:

Model assumption
Scale of interest
Consideration for overfitting
Align with prediction set

However, dynamic features are absolutely suitable for training, or combination of static and dynamic ones.

The details are discussed in the additional documentation Tips for data types.

Are categorical features supported? More clarification regarding best practices for how to extract and format covariate data would help a lot.

Yes, categorical features are supported. We mentioned the feature formatting in Tips for data types.

Relevant PR(s): #14

chenyangkang commented 8 months ago

Thank you for your advice on more detailed annotation on notebooks. I have add additional comments and annotation to all notebooks. For example, please see here.

In the intro notebook, for example, why were the parameters Spatio_blocks_count=50, Temporal_blocks_count=50 selected? what would turning these numbers up and down do?

Thank you for pointing out this specific issue. We explained the choice of spatiotemporal train-test-split parameters in AdaSTEM demo, as they approximately catch the spatiotemporal scale of bird migration.

What are the gridlen{}_{}_threshold parameters, and what do those defaults encode?

Thank you for pointing out this confusion. To help user better understand the parameters, we add inline comments at the home page usage example and AdaSTEM demo, such as:

model = AdaSTEMRegressor(
    base_model=Hurdle(
        classifier=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1),
        regressor=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1)
    ),                                            # hurdel model for zero-inflated problem (e.g., count)
    save_gridding_plot = True,
    ensemble_fold=10,                             # data are modeled 10 times, each time with jitter and rotation in Quadtree algo
    min_ensemble_required=7,                      # Only points covered by > 7 stixels will be predicted
    grid_len_upper_threshold=25,              # force splitting if the longitudinal edge of grid exceeds 25
    grid_len_lower_threshold=5,               # stop splitting if the longitudinal edge of grid fall short 5
    temporal_start=1,                           # The next 4 params define the temporal sliding window
    temporal_end=366,                            
    temporal_step=20,
    temporal_bin_interval=50,
    points_lower_threshold=50,                    # Only stixels with more than 50 samples are trained
    Spatio1='longitude',                          # The next three params define the name of 
    Spatio2='latitude',                         # spatial coordinates shown in the dataframe
    Temporal1='DOY',
    use_temporal_to_train=True,                   # In each stixel, whether 'DOY' should be a predictor
    njobs=1
)

Tracing back to the original function is often still challenging, as there are many parameters with slightly varying names that are still hard to understand. As a user I would appreciate more narrative clarification.

I agree with you. For the first-time user, some parameter settings could be overwhelming. I hope the additional inline annotation, documentation, and notebooks could help clarify it.

Relevant PR(s): #16, #17, #18

chenyangkang commented 8 months ago

Thank you for your comments on adding more explanation about functionality

I'm not familiar with the Hurdle modeling approach, which seems great and appears foundational to AdaSTEM - tips on best practices here would be valuable.

Definitely true. I add more information about what tasks stemflow could do, especially what is hurdle in the new document Tips for different tasks. We also mention the function repertoire at the home page.

You mentioned using other base models from sklearn in the manuscript - when would this be advantageous?

Thank you for your questions. We add additional analysis and bump chart for comparing the performance of using different base model in Base model choices. Basically the trade-off exists among "performance – speed – memory usage". stemflow is a bagging modeling framework, which makes it time-consuming and memory-consuming. We would want the best trade-off between performance and resources consumption.

You also provide great examples for comparing learning curves and optimizing strixel sizes, but these notebooks are just code blocks with some plots and little interpretation. I found myself wanting more guidance for how to interpret these results, or even guidance on why such optimization is important (do you optimize strixel size to minimize overfitting, for example?).

Thank you for your suggestions. I add additional charts, comments, and conclusion at section Optimizing stixel size for addressing why such optimization is important and how to optimize the stixels. I hope this helps users in understanding the parameter choices.

Relevant PR(s): #15, #16, #17, #18

chenyangkang commented 8 months ago

Still exploring the possibility of using geopandas as dependency #30

chenyangkang commented 7 months ago

Not using geo-features for reasons mentioned in #30

chenyangkang / stemflow

[REVIEW] JOSS feedback #2