benetech / Imageshare

MIT License
0 stars 0 forks source link

Master CSV File to convert into JSON #76

Closed clapierre closed 4 years ago

clapierre commented 4 years ago

I am currently creating a master CSV file that will be used to generate the master JSON Manifest #71 I will attach it to this ticket once complete.

jkva commented 4 years ago

@clapierre is it valid for a file to not have a license?

clapierre commented 4 years ago

No we will have to have a license for each file, but if no license is provided in the JSON file we should assume the most basic license maybe just GNU? Ultimately we will need to get the licenses from the 3rd party source of the files. Just wondering if maybe we don't include the license which would flag us to find the real license prior to making a resource file public.

Which gets me wondering if we need to have a boolean at either the Resource or Resource File level on if this resource/file is available publicly or not.

jkva commented 4 years ago

@clapierre I've noticed that in the Master file, language has as value a shortcode like "en". In the original MVP doc the taxonomy for language was a list of full names: "Braille", "German", etc. Should I be mapping these shortcodes or can these be explicit in the generated JSON?

jkva commented 4 years ago

No we will have to have a license for each file, but if no license is provided in the JSON file we should assume the most basic license maybe just GNU? Ultimately we will need to get the licenses from the 3rd party source of the files. Just wondering if maybe we don't include the license and assume GNU which would flag us to find the real license prior to making a resource file public.

I can default to a license, sure. By default a resource file is published as draft anyway.

clapierre commented 4 years ago

Right Sina pointed out that Braille isn't a language but a script and it would still be English, and if we go with the normal language codes then we can display them any way we want in the UI. SO this is why I thought either "en", "fr", "de", "es" etc.

jkva commented 4 years ago

Right Sina pointed out that Braille isn't a language but a script and it would still be English, and if we go with the normal language codes then we can display them any way we want in the UI. SO this is why I thought either "en", "fr", "de", "es" etc.

Ok, in that case it's probably better if I change the taxonomy on my end.

clapierre commented 4 years ago

@jkva do you have a list of the most common licenses as a start? We will add other licenses which may be proprietary in the case where we point to a 3rd party library as will be the case with the "DCMP Membership" license that we will have for all those videos.

jkva commented 4 years ago

That still needs us to account for the "All languages" case, though.

clapierre commented 4 years ago

That still needs us to account for the "All languages" case, though.

Yes

clapierre commented 4 years ago

For the Language codes we will use ISO 639-1 Language Codes

jkva commented 4 years ago

@clapierre and "All languages" being "all", then?

clapierre commented 4 years ago

For the Licenses this is what we started with but I know there must be a more complete list somewhere.

o CC BY 4.0 o CC:BY o CC:BY-NC o CC: BY-NC-ND o CC: BY-NC-SA o CC: BY-ND o CC: BY-SA o DCMP Membership o GNU-GPL o OER

Others we need to include? maybe CC BY 3.0, etc.

clapierre commented 4 years ago

@clapierre and "All languages" being "all", then?

Sure I think that will be fine.

jkva commented 4 years ago

It would be nice to map Accessibility Accommodations to an array as well. Otherwise I'll map on the plugin side.

clapierre commented 4 years ago

I think that makes sense Tags and Accessibility Accommodations are arrays in the JSON manifest. I assume you agree Sina.

sinabahram commented 4 years ago

yes to tags and accomodations being an array. RE licenses, Creative Commons (CC) seems like something worht while to track e.g. CC0 etc.

theladymay commented 4 years ago

Just a heads up, I’m the Program Manager for Imageshare, not the Product Manager. Charles is the Product Manager and he can keep that job. Amaya <cleaned up :)>

sinabahram commented 4 years ago

Sounds good 😊.

Quick thing, @TheLadyMay, if you’re going to respond from email, which I tend to do as well, please erase everything in the email and then type your response, because all of it gets shoved into the web interface if you don’t.

clapierre commented 4 years ago

Good point @sinabahram I have at times forgotten to do that and yeah definitely makes a messy GitHub issue thread.

clapierre commented 4 years ago

I have just checked in an updated version of the excel/txt and json file including a new entry for a DCMP video with two languages.

In the process some invalid characters we will need to figure out special start and end quotes and en-dash seems to cause the the txt file to get rev. ?'s appearing and the JSON file will get \udc97, \udc93 and \udc94. not sure if there will be an easy way to go through the excel spreadsheets to convert all these and strip spaces etc.

sinabahram commented 4 years ago

This is a good catch, Charles, and quite important. Let’s please avoid any Unicode in the spreadsheet.

Those quotes aren’t actually quotes but high Unicode characters, so they should be replaced with regular quotes, same for all other things. It sounds like it was copied from some other source with all that going on. You may wish to first paste it into a regular text editor and then copy paste it back.

jkva commented 4 years ago

Currently the input files are being validated against https://github.com/benetech/Imageshare/blob/Development/wp-content/plugins/imageshare/assets/import.schema.json . That's pretty strict; but I'd prefer not to do any data marshalling plugin-side if I can avoid it.

sinabahram commented 4 years ago

Sorry, @jkva is there an ask here? e.g. is it not validating currently? Or, are you just letting us know?

jkva commented 4 years ago

@sinabahram Sorry - it's meant to be informative as I added it today.

clapierre commented 4 years ago

Hi @jkva I see the following for subjects in the validator "subject": { "enum": ["Biology", "Chemistry", "Physics", "Environment", "Earth", "Astronomy", "Algebra 1", "Algebra 2", "Calculus", "Statistics", "Engineering", "Circuits", "Computer Programming"] },

So if there is a subject not in there whats the process for adding new subjects?

Same question for License and Accommodations.

jkva commented 4 years ago

I should add that the current resource file does not validate against the schema, not without me making some modifications here and there. To me that suggests that the existing taxonomies are not entirely fleshed out yet, or not properly synchronised.

jkva commented 4 years ago

@clapierre I've currently got them hardcoded as to illustrate how it would work, eventually I would like the schema to be partially dynamically generated out of the taxonomy scheme that generates the internal WordPress taxonomies.

That would then ensure that the existing WordPress GUI can be used to add subjects, licenses, accommodations, et cetera.

jkva commented 4 years ago

@clapierre I'll likely convert it into a twig template as I'm already using twig, and it would become output of some ImportValidatorHelper or some such.

jkva commented 4 years ago

@clapierre To clarify, it would be better to have the languages be fully-formed ISO 639-1 names, "English", "French", etc, if possible - this means I don't have to do any mapping on the plugin side and will mean that new languages could be added via the WordPress admin interface.

If not, I can still make it work, but it's not ideal.

sinabahram commented 4 years ago

You prefer “English” over “en-us”?

clapierre commented 4 years ago

I thought if we used the ISO 639-1 Language Codes which has "en", as generic "English" and then there is also the option to as Sina points out to have "en-us" or "en-uk" etc. but the current txt file does have the ISO 639-1 code "en" which I would think is what we want. right?

sinabahram commented 4 years ago

Is this issue ready to be closed out?

clapierre commented 4 years ago

I think so, I believe you may have wanted to remove my script to do the conversion from UTF16 to 8 and have that done in your python script? We can leave this as is if you would rather @sinabahram

sinabahram commented 4 years ago

I think it's fine for now. That's a pretty small optimization if anything, so let's see if any problems arise with the current workflow.