NFDI4BIOIMAGE / training

https://nfdi4bioimage.github.io/training
Creative Commons Attribution 4.0 International
8 stars 5 forks source link

Data cleaning #99

Open haesleinhuepf opened 5 days ago

haesleinhuepf commented 5 days ago

I recently saw that our data needs a little cleaning. For example, same licenses are called differently here:

We should also check other entries such as tags if there might be similar issues.

It would be great if we could standardize these meta data entries, of course ideally automatically. The notebook here might help as it summarizes existing licenses:

https://github.com/NFDI4BIOIMAGE/training/blob/main/scripts/license_statistics.ipynb

@SeverusYixin would you mind having looking into this? Thank you!

SeverusYixin commented 4 days ago

hmmm, I didn't get the requirement point here.

E.g. this is the first one, am I right ? authors: Robert Haase et al. license:

and second is this:

3rd:

these three all belong to the same license, so your requirement is that we want to use the same license for them like all of them should be used as cc-by-4.0? Am I right? So what's the "primary key" of them, I mean how can I make sure that like Creative Commons Attribution 4.0 International or CC BY 4.0 or some others are same? Base on "https://github.com/NFDI4BIOIMAGE/training/blob/main/scripts/license_statistics.ipynb" to check it ? So it need something like an app can automatically correct them with same license, am I right ?

haesleinhuepf commented 3 days ago
  • CC BY 4.0
  • license: cc-by-4.0 license: Creative Commons Attribution 4.0 International

these three all belong to the same license, so your requirement is that we want to use the same license for them like all of them should be used as cc-by-4.0? Am I right? So what's the "primary key" of them, I mean how can I make sure that like Creative Commons Attribution 4.0 International or CC BY 4.0 or some others are same? Base on "https://github.com/NFDI4BIOIMAGE/training/blob/main/scripts/license_statistics.ipynb" to check it ? So it need something like an app can automatically correct them with same license, am I right ?

Yes, exactly! I also don't know yet, how to ensure these are the same licenses, but I presume it is possible to program this. First, we need to find out what would be a good identifier for the license. Maybe, the SPOX license ID is a starting point? https://spdx.org/licenses/ or https://opensource.org/license . Please explore a bit what kind of standards exist.

So it need something like an app can automatically correct them with same license, am I right ?

Yes, that would be cool. We could build this into the github workflow for example like we do here and here, to check for duplicate entries in our database. Or this could be a standalone app / notebook we call from time to time to check for consistency of our data.

Thank you!