International-Soil-Radiocarbon-Database / ISRaD

Repository for the development and release of ISRaD data and tools
https://international-soil-radiocarbon-database.github.io/ISRaD/
24 stars 15 forks source link

Improve build fx efficiency #209

Closed jb388 closed 4 years ago

jb388 commented 4 years ago

The build function performs QAQC on all files in the data directory, which is inefficient given that the majority of templates already pass. Proposed solution: check QAQC file names against template names (verify that QAQC passes), and then only QAQC new templates.

alkalifly commented 4 years ago

That would be great. But it should also check that the modification time of the QAQC file is later than that of the template file. That way, if an existing template gets updated, the new version will get checked.

On Dec 21, 2019, at 08:58, Jeff B notifications@github.com wrote:

The build function performs QAQC on all files in the data directory, which is inefficient given that the majority of templates already pass. Proposed solution: check QAQC file names against template names (verify that QAQC passes), and then only QAQC new templates.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

coreylawrence commented 4 years ago

I second Paul's comment. There are instances where templates that have passed QA/QC have been updated in the repository. Of course, any changes should be run through the QAQC processes before being implemented but it would be best to double-check.

jb388 commented 4 years ago

I spent some time working on this today and made some initial progress. Changes to 'compile.R' are on the dev branch.

Specifically, the function now 1) checks if the template is in the database already, and 2) if it is, checks whether the data are all the same. QA/QC is only run if either 1) or 2) are FALSE.

@alkalifly I decided to use the actual data as the test, rather than the modification date.

The primary efficiency improvement remaining is in the ISRaD_extra process. Similarly to compiling the database and running QA/QC, I'd like to run the ISRaD_extra functions only for the new data. That's a project for another time, however...

jb388 commented 4 years ago

Update: new compile function is now on master.

While going over performance of the new function again today, I actually caught an instance where strange characters had been inserted in the ISRaD_list.xlsx file: e.g. the Caner_2003 template has the character "<" in the frc_name field, which was parsed as "<".

So, I guess another advantage of the new function is that it provides a mechanism to catch these sorts of errors, i.e. discrepancies between data as entered in the template and as rendered in the compiled data product.