benetech / Imageshare

MIT License
0 stars 0 forks source link

Master CSV File to convert into JSON #76

Closed clapierre closed 4 years ago

clapierre commented 4 years ago

I am currently creating a master CSV file that will be used to generate the master JSON Manifest #71 I will attach it to this ticket once complete.

sinabahram commented 4 years ago

Hi, I would advise not doing this so that we can first agree to a schema before generating anything.

clapierre commented 4 years ago

@sinabahram Sorry that was meant to say master "CSV" not "JSON" I updated the description.

clapierre commented 4 years ago

Imageshare-Master.xlsx Imageshare MVP Database Taxonomy.docx

I will create the CSV file once we all agree to the format of this excel spreadsheet that has 11 resources and close to 60 individual files as a starting point for the Imageshare database import

I have included examples from: Open Stax, Library Lyna, APH, Central Access, Thingiverse, and DCMP And have resources of images, video, tactile, 3D models Accompanying image description, simplified and extended image descriptions as well as production notes for tactile and 3D models.

I hope this is a comprehensive list and what we would expect to import from various libraries, but should be enough to get started.

Happy to tweak this model especially on the various accommodations.

sinabahram commented 4 years ago

Just confirming that this is only to test the format because there’s duplicate values for image alt and resource description, missing featured alts, etc.

sinabahram commented 4 years ago

This is a good start. Can you please move tags to resource_tags, not at the file level, and then can you fill out the first four or five rows only? Right now there are tons of inconsistencies there so it's hard to get a complete picture. Can we also rename subjects to subject, singular, unless if you are intending for that to be multiple subjects? If so, let's do an example of one with multiple subjects to see that in action. We'll need to use a tab-separated file since fields that have multiple values need to be separated with a cama, which then precludes us from using CSV, which is fine.b

sinabahram commented 4 years ago

One additional thing, can we just have a list of fields for Collection, Resource, and File. Over 90% of the info in that Word document does not apply to this system what-so-ever. We just need three issues on this repository for Collection Fields, Resource Fields, and File Fields. Alternatively I think we have an issue for data model so I'll go update that appropriate place or make us an issue to finalize this. We should not be discussing sizes, character counts, encoding type, or any of that ... just a list of fields for now.

clapierre commented 4 years ago

Yes I can move the tags to the resource level. I forgot to remove all the duplicate resource image and alt text descriptions which I have cleaned up. For the resources which is missing an image and its alt text thats because I don't have an image or its alt text or have the image but no alt text yet so we need to make sure we can still import that and fix it afterwards.

Yes I can change to subject singular not sure if we will ever have a resource in multiple subjects but wasn't sure if I should exclude that from happening.

We can discuss the inconsistencies during our scrum, and address the word doc then too. I will clean up these documents as best I can then we can revisit whats left to do.

clapierre commented 4 years ago

Imageshare-Master.xlsx Imageshare Resources-Files-Collections.docx

Here is these two files cleaned up based on Sina's feedback.

clapierre commented 4 years ago

Ok after the SCRUM meeting with @sinabahram here are the updated Excel spreadsheet and word document, along with a Tab Separated Value (.txt) list since we have commas contained within and we won't be including Tabs in the actual data. Imageshare-Master.txt Imageshare-Master.xlsx Imageshare Resources-Files-Collections.docx

Happy to tweak this further if there are any issues found.

sinabahram commented 4 years ago

Ok, I’ve got this structure converting to json with a Python script. @clapierre can you please make sure to trim some white space out of these cells as you go? I'll also do it on the script side if I find a moment, but try to make sure trailing and leading spaces are not present if possible. Also, I think we shouldn't have newlines in descriptions. We can format once on the website if ever necessary, but let's have strings without newlines for now is my suggestion.

@jkva I have attached a sample json file for you to tell me about anything that is wrong, because otherwise, once @clapierre does another round of clean up on this file, you can use the resulting json as our sample input for continuing the build out of the models. I'm happy to clean up the column names, Job, to match exactly the WP field names.

See this json file for what I'm talking about.

clapierre commented 4 years ago

@sinabahram I think you used an old version of the txt file because under file_accomodations I no longer have "Image Description" but "Visual Description" instead.

Also, when I open it up for example I only see 1 file for the first resource "Carbon Cycle" but as we discussed in our call I needed to keep resources on 1 row and files on the other rows.

I have cleaned up the leading/trailing spaces and newlines, and also fixed a couple missing alt text descriptions I forgot to add yesterday.

Here is the two updated files the excel spreadsheet and the tab separated text file. Imageshare-Master.txt Imageshare-Master.xlsx

clapierre commented 4 years ago

@sinabahram also note we decided to change a couple of the header column names so that will need to be tweaked as well, and while removing spaces I found a trailing space in the header row column which also has been fixed.

sinabahram commented 4 years ago

Sure thing. See below for updated json based on that text file. Column names don't matter, just the split point. The script uses whatever column names are there. Once we make sure this is sufficient for importing (which relies on @jkva signing off), then we can change the column names before conversion to make things easier on the WP side of things. Job, wp-cli may also be helpful to us here, possibly allowing us to avoid import what-so-ever at the WP layer given how truly painfully slow WPEngine is.

Imageshare-Master.json.txt

clapierre commented 4 years ago

@sinabahram Yes much better, but looking it over I noticed a couple other small errors and that I was missing an entire resource who's description and source got removed by accident. which got me thinking why am I doing this in the issue.

So I have created in the development branch a "resource-scripts" folder and have my excel xlsx file and tab separated txt file.

Sina If you can put your script and its resulting json file in that folder then at least we got history which would have helped me realize I removed on accident some data when checking in and reviewing the diffs.

sinabahram commented 4 years ago

I don’t think it’s a bad idea for us to track this in it, but won’t that folder get pushed to WP?

Are the deployment scripts only pushing wp-content?

If so, then yes, I’ll do this, but otherwise, let’s figure that out first so we’re not pushing those files and code to deployment. Also, the deployment scripts need to get updated if we’re going to do this because each push will result in a deploy to WP, right?

clapierre commented 4 years ago

Right Lets see what @johnhbenetech says regarding the pushing from GitHub to WP. I was thinking only the content in the WP folder would get pushed but good to verify that so that we can keep documentation and other scripts / files which should remain only in GitHub.

johnhbenetech commented 4 years ago

@sinabahram @clapierre The WP engine deployment is a git based deployment. Anything that isn't in the gitignore file will get deployed.

So for example after Charles's latest commit this is now live: http://imgsdev.wpengine.com/resource-scripts/Imageshare-Master.txt

So anything you don't want to be deployed can just be handled through the gitignore

clapierre commented 4 years ago

But doesn't adding those files or directories to GitIgnore also mean that if I make changes to that file it won't push then to Github either?.

sinabahram commented 4 years ago

That is what I suspected regarding it being deployed, so let’s remove that. However, I completely don’t understand about putting this in the ignore file. Won’t it then not get tracked which is not what we want, right? What am I missing?

clapierre commented 4 years ago

Thats what I was saying. Ok I will remove it from GitHub and put the updated files in here but there must be a way to do this so that it will get updates on GitHub only and not pushed to WP.

johnhbenetech commented 4 years ago

@sinabahram @clapierre you're both right, I think misunderstood the purpose of the files in question.

What do we think about putting these types of resources in a different repo?

clapierre commented 4 years ago

Yeah I guess, or maybe a different branch @sinabahram? A documentation Branch? Hate to have multiple repos for this.

johnhbenetech commented 4 years ago

In other projects we have separate repos for reference data and other stuff like that. You could even make it a private repo.

In other places we use the github wiki to store documentation, but the history tracking isn't as straightforward

clapierre commented 4 years ago

Right @johnhbenetech, I want the history tracking, I also like to have everything under one Repo, we could do a new private repo for that, but I am thinking this could get done by having a Documentation Branch which will never get pushed to WP and can contain all our documents so everything is in one GitHub repo and is versioned. Anything wrong with that @sinabahram? Or is it just easier to make an Imageshare-docs GitHub new repo for all this.

sinabahram commented 4 years ago

I’m not a fan of using branches for this. That’s not what branches are for IMHO. Branches are for different versions of things that will eventually get merged into the eventual code base.

I think that this deserves another repository because we also don’t want issues and tickets around data flow, data practices, other additional scripts in the future, etc. to disrupt the WP side of things.

So, an Imageshare-utils repository seems appropriate. Nothing to do with documentation, though. That should definitely be tracked in the original repository because all documentation should presumably live in WP.

sinabahram commented 4 years ago

IMHO, it’s generally considered bad practice to do this, Charles. A documentation branch only makes sense if you’re off working on it on the side and then plan on merging it back in e.g. the documentation system for the site or something related, not something that will never get merged what-so-ever. That’s just two completely separate file trees, which deserve their own repository.

Also, you keep saying documentation, but I am assuming you mean utilities, and I wish to stress that documentation does belong in this repository, or rather in WP as actual content.

Nothing on this thread has anything to do with documentation, or so I claim, only utilities, data processing, etc., so I just want us to all be on the same page about that.

Does that all make sense?

clapierre commented 4 years ago

@sinabahram Yes makes sense, and agreed, I see what you are saying and take does make sense. Ok I will make a new Repo Imageshare-utils and will add everyone to it and then push my two files.

sinabahram commented 4 years ago

Right on. Once you do that, I can then push the script. I documented it a bit more than usual for such a script since I knew someone else would be maintaining after we’re off the project.

clapierre commented 4 years ago

Thanks @sinabahram, the new repo is now created and you, @johnhbenetech and @jkva have been granted access. Let me know if anyone else needs access as well.

I pushed my latest code there as well so you can do one last conversion to JSON and include your script as well.

sinabahram commented 4 years ago

Awesome. I did a bit of re-org just to make things easier on the command line and for new-comers to the repository. I added a Data Files directory as well. The script is in there. I also modified the script to strip the “resource” and “file” prefixes for cleaner JSON output.

I suspect it’s cleaner to have the tags in an array, but I need @jkva to comment on that. Job, do you prefer “tag a, tag b, tag c” or [‘tag a’, ‘tag b’, ‘tag c’] for import purposes?

spacesCharles, in the mean-time, can you strip the white space like “tag a, tag b” should be “tag a,tag b”. It’s fine for spaces to exist in tags I believe, but what I’m wanting to avoid is needing to deal with leading spaces on import. A global find and replace on “, “ to “,” should fix that right up.

clapierre commented 4 years ago

@sinabahram Ok I removed the space after comma's in tags and in accommodations.

I also made all the data in tags & accommodations lowercase to help with constancy.

For Subject Area should those also be lowercase? Currently the initial letter is uppercase.

sinabahram commented 4 years ago

Charles, the initial caps should be fine.

jkva commented 4 years ago

This looks good. A few things:

jkva commented 4 years ago

One more thing, the contributor resource field doesn't seem to exist here.

clapierre commented 4 years ago

Hi @jkva after speaking with our team and @sinabahram we have decided the following regarding your points.

I recall the early data model referring to grade(s) on the resource file level. Is this no longer required?

We thought grade level may be too restrictive because if we say something is 6'th grade first off we may not know what grade it really should be and 2nd a person in 9'th grade may need that resource since they are reading at a 6'th grade level. So for now we are not going to include that.

Should length be in minutes? Not seconds?

Sina and I talked about that and it could be but I don't think we will get a lot of videos that would require that level of precision. Knowing it is 10 minutes and 30 seconds is not crucial and just knowing that it is ~10 minutes long is more important to know if this is something you can show in your class in the last 10 minutes of class for example.

Regarding tags, an array would be preferred since that's what I'd be mapping to internally anyway.

@sinabahram can you adjust the script to make this an array.

Similarly, according to the early data model, subject was an array. Is this no longer the case? If still an array, it would be nice if it was in the json file as well, e.g. "subject": ["physics"],

Right Sina questioned if we would ever have multiple subjects for a resource. We may have sub domain under a particular Subject area but we would just use that Sub Domain as the Subject and the umbrella subject area would be implied. Ie. "Science"/"Biology" (we would use Biology) or "Math"/"Geometry" (we would use Geometry).

tags and accommodations being lowercase is good. The less massaging of the data I have to do, the better.

I figured.

regarding file URIs, there are clear differences between relative bucket paths and remote uri paths. I'm fine with this, it's easily accounted for via some utility functions, but I want to make sure this distinction between local-relative and remote-absolute is necessary in the first place.

Right, the S3 buckets have the URI's with spaces replaced with + signs. Where as actual URI's to 3rd party resources may have actual spaces Here are currently two examples we have (no spaces for the 3rd party yet but there could be? or maybe that would be with escaped %20 in the uri) IE: https://imagesharemvp.s3.amazonaws.com/ImageshareResourceFiles/Central+Access/Summer+2018/Frog+dissection/Frog+dissection.zip vs. https://dcmp.org/media/3559?ref=imgshr

clapierre commented 4 years ago

Right as for

One more thing, the contributor resource field doesn't seem to exist here.

After speaking with Sina we felt the Source field would suffice for now. I think after the MVP and this contract if we add the new feature of user accounts and others to upload content we can add in that functionality then.

sinabahram commented 4 years ago

Yes, I’ll handle the array thing.

Charles, It is important that URIs do not have spaces. It is not a URI if it has spaces. It’s just an invalid string, full stop. URIs may absolutely not containing any non-URI characters. This needs to be strictly enforced.

Job, can you verify that if we set up subjects in WP like so:

Science:

           biology

           geology

           physics

and then we just put biology, that science is implied e.g. a hierarchy? In short, I’m asking if taxonomies can have a hierarchy and you can look it up by child node?

sinabahram commented 4 years ago

Also, @jkva there shall be zero relative paths anywhere, full stop. Those are only example values @jscholes put into the file until he had S3 strings to put in.

clapierre commented 4 years ago

The local relative paths have been removed so there is no longer any spaces, as they are now in the S3 bucket and +'s like I said replaced the spaces in the folder structure. I also checked and the DCMP video resource paths and featured image URIs also do not contain spaces so I think we are in good shape.

jkva commented 4 years ago

Job, can you verify that if we set up subjects in WP like so: Science: biology geology physics and then we just put biology, that science is implied e.g. a hierarchy? In short, I’m asking if taxonomies can have a hierarchy and you can look it up by child node?

@sinabahram Yes. The current subject taxonomy is hierarchical. I can look a term up by name and find its parent term (if any) and walk the tree recursively if necessary.

jkva commented 4 years ago

@clapierre thanks for your detailed answers. I'll reflect them on my side.

sinabahram commented 4 years ago

@jkva, that’s great. Thank you!

clapierre commented 4 years ago

Hey @sinabahram so I know now why we thought we needed multiple subjects. In the DCMP videos they have the same resource repeated 3 times and the only change between all 3 of them was the subject area Here are a couple examples.

Electricity and Magnetism,Energy,Physics Oceanography,Marine Life,Environmental Issues Animals,Science Experiments,Science Methods Environmental Issues,Conservation Conservation,Plants,Environmental Issues

So how should we handle this? For some things like "Science Methods, or Science Experiments" we can put those under tags, but what about the others?

sinabahram commented 4 years ago

Charles, can you please give a single example of two things that are identical but have different tags. I don’t see that reflected in the data you pasted

clapierre commented 4 years ago

Sure, was looking over all the original data and then Amaya and I noticed a lot of duplicates and we tracked them down to having this issue with the subject

Current Electricity Topic: Business, Science Subtopic: Electricity and Magnetism, Energy, Physics

sinabahram commented 4 years ago

Charles, there’s no “duplicate” in your example. Maybe I’m missing something?

You said that two things may only differ by their subjects, but that’s not what you have. You only have one example with a topic and subtopic. What am I missing?

Let’s break it down by line

Current Electricity https://dcmp.org/media/5904?ref=imgshr

That is terribly named, but not much we can do about that I suppose. Electric Current would make a ton more sense. So, that’s the name of the resource.

Then we have

Topic: Business, Science

This makes little sense to me. Business for electric currents? That seems completely random and unrelated. Science is obviously correct.

Next line

Subtopic: Electricity and Magnetism, Energy, Physics

Ok, that kind of tracks, but aren’t those just tags? Are you really going to build out an entire data model for all topics and subtopics covered in American education? That will take months.

There’s nothing preventing you from having a hierarchy that goes

Science

           Physics

                          Energy

                                         Electricity and Magnetism

But that seems like a ton of work for very little gain. Are you really going to have multiple resources on topics four levels deep in the tree?

So, I just want to be clear. We may need to change our model a bit, but first, can we get a firm handle on this data? Is DCMP really that specific and hierarchical about all topics?

I just don’t even see how that works. How would they model a concept like work? Work is both mechanical but also potential, so you have to then have duplicate trees for similar concepts. How would they model waves. Waves are mechanical and electro magnetic.

This really feels more appropriate for tags. I would do this as physics under science as the subject, and then electro magnetism and energy can be tags. No earthly idea what business is doing there.

let me know your thoughts

clapierre commented 4 years ago

Hi Sina, so in my example in the Excel spreadsheet there are like 6 duplicate entries all pointing to the same video of Current Electricity but the only difference is that under the subject column one has "Science" then " the 2nd row has all the same info except the subject column has "Energy" , etc. obviously its the same resource with just different subjects that this video covers. Which is probably why "Business" is in there too.

We would obviously remove all those duplicate entries in our excel spreadsheet and combine the subjects, which is why we were thinking we either add the ability to just have multiple subjects or we have 1 subject and move the rest to tags.

I don't think we have to have a full hierarchy 4 levels deep, its just tagging other subject areas which may or may not be related like business and science. I am just guessing at how they are doing this. All I know is that they have for this example two main "Subject" areas "Business" and "Science" then the sub topics which are areas within science it seems "Energy", "Electricity and Magnetism", and then "Physics" etc.

So yes we could just put those under tags and call it a day. tricky part will be to pick what the main "subject" area is and then put the rest under tags.

So in this particular case I would think the subject area would probably be what "Energy"? or "Electricity and Magnetism" and then the rest would just be tags? I don't think I would even include "Business" to be honest.

So let's say we just have a two level deep hierarchy for Subject and anything that would go to a 3rd or 4th level will all just get put at level 2. EG: Physics (level 1) Energy (level 2) Electricity and Magnetism (level 2)

Obviously Physics would be under Science but I don't think we need to even have that right?

So then couldn't we mark all subjects which make sense for this resource [Energy],[Electricity and Magnetism]

So then if someone is searching for the subject field of "Energy" they would find this resource. Likewise if they search for the subject field of "Electricity and Magnetism" they would also find this resource.

sinabahram commented 4 years ago

Well, I do think physics should be under science, absolutely, but other than that, I think you and I are saying the same thing, which is that the rest should be tags e.g. energy, electro-magnetism, etc.

clapierre commented 4 years ago

Yeah sounds good @theladymay (btw is Amaya our Product manager for Imageshare) lets do this then have the main subject area be the main area of Science each resource resides under and any other sub topic areas will be put in the tags field.

Thanks Sina for hashing this out!

sinabahram commented 4 years ago

Sure thing. I’ll make sure that tags turns into an array on next update to script.