SPI-Birds / pipelines

Pipelines for generating a standard data format for bird data
2 stars 6 forks source link

Subsetting function #133

Closed ChloeRN closed 3 years ago

ChloeRN commented 4 years ago

I made the following updates to the previously existing subset_datarqst() function:

Further, I made some basic tests for the function. Suggestions on what else to add to the tests are very welcome.

The pull request also contains some changes to make run_pipelines() work smoothly on Mac. Specifically, I changed the naming of paths, and added functionality to skip pipelines for populations with Access raw data when OS = Mac.

ChloeRN commented 4 years ago

It would be good to test the function on the "complete" set of standard data. I was only able to test it on standard data for which pipelines currently run on Mac. This means the following were not included in my workflow/tests:

(In the tests, I further dropped BOS, PEE, and WYT out of purely practical reasons, i.e. they take very long to run).

ChloeRN commented 4 years ago

@LiamDBailey and @StefanVriend thanks for the constructive feedback, I have now taken into account almost all of it, so requesting re-review for you to check the improvements I made.

@StefanVriend requested to add some progress bars to the subsetting function to visualize the progress. At the moment, I keep getting warning messages that progress bars are now deprecated. So not quite sure if and how (and where) to include progress bars in the function at this point. Opinions?

StefanVriend commented 4 years ago

The function progess::progress_bar works quite fine! You can find an example in check_values_brood in "brood_check.R".

When I run subset_datarqst(PopID = "HOG") and then select the location of the .RDS file, I still run into the following error: Error in eval(lhs, parent, parent) : object 'standard_data' not found. Do I do something wrong?

ChloeRN commented 4 years ago

@StefanVriend I double-checked the issue with reading in the standard data now, and it looks like you were indeed correct. Now changed the code to what you suggested earlier. Let me know if it runs fine now.

StefanVriend commented 4 years ago

@ChloeRN Yes :), it worked now. However, it took 55 minutes. Do you know why it might have taken so long?

ChloeRN commented 3 years ago

@StefanVriend I had a look into it, and what takes longest is the part of the function that goes through each individual to determine if its Species is conflicting, and if so, whether the observed species are relevant for the output. I have made two changes here: 1) Added a filter so that the function only checks "relevance" of observed species if the species is recorded as conflicting in individual data (this should drastically reduce the runtime). 2) Added a progress bar - as you suggested - to this part of the function (although after changing 1), it's so short that it does not even display...)

Additionally, I've added some messages into the function that give the user an overview over the steps.

Please try again, and let me know if it works better/faster now.

StefanVriend commented 3 years ago

Looks good. 9 min and almost 14 seconds for HOG is a big improvement compared to the 45-minute running time.

If we all agree that Rows should be reset to 1, that piece of code should indeed be added after each table is arranged as you suggested above.

ChloeRN commented 3 years ago

Made final adjustments following Liam's suggestions and everything seems to work smoothly now. Ran tests and RCMD checks, all were passed. Ready to merge now.

LiamDBailey commented 3 years ago

@ChloeRN I'm having some problem getting the tests and checks to work on my side. Could you show me the output of the tests and checks to compare?