LLNL / scraper

Python library for getting metadata from source code hosting tools
MIT License
49 stars 23 forks source link

MS Excel / CSV to JSON File Converter #33

Closed RicardoAReyes closed 3 years ago

RicardoAReyes commented 5 years ago

Many of the CFO Act agencies required to comply with the OMB M-16-21 FSCP use MS Excel to collect their software inventory information from many organizations and a multitude of engineering programs. Also, some of the organization use legacy version control systems (vcs) that prevent them from running the Scraper tool. The process is labor intensive time prone to human error.

For that reason we would like for agencies to use the Scraper tool to systematically convert their MS Excel or CSV data into a JSON file while adhering to the Code.gov Metadata Schema 2.0.0 requirements.

Source files can be made available. However, the main objective would be to standardize the process by creating a MS Excel Template with all the Metadata Schema 2.0 fields.

Example Screenshots:

Screen Shot 2019-05-17 at 11 59 45 AM Screen Shot 2019-05-17 at 12 00 00 PM

IanLee1521 commented 5 years ago

As mentioned in the call, the most important piece that is needed here will be the mapping of CSV headers to Code.json fields. Once we have that, it shouldn't be hard to add this back in.

This used to be done for a DOE spreadsheet. That code was removed at some point, but can be found: https://github.com/LLNL/scraper/blob/75381f2334e55fcfb3e4c66a983d55e214971da0/scraper/code_gov/doe.py for reference / a starting point for re-adding this functionality.

RicardoAReyes commented 5 years ago

@IanLee1521 we are working on creating the CSV file with the proper headers. :-)

RicardoAReyes commented 5 years ago

@IanLee1521 @leebrian

Here is the official Code.gov Software Inventory MS Excel Template. Code.gov_Metadata_Schema_2.0.0_Template.xlsx

Also, I've created an example file with the USDOT repo data. DOT uses MS Excel to track their software inventory. DOT_Software_Inventory_Example.xlsx

IanLee1521 commented 5 years ago

For anyone else staring at this... The excel sheet is a bit wider than a MacBook Pro can display and I had to narrow some columns to get it to scroll.

The formatting is a little odd for the spreadsheet, so I might have to work a bit harder to pull these in, but since it's the "official" spreadsheet, that should be OK.

RockManJoe64 commented 5 years ago

I am unclear exactly what the requirement here is. The scraper is a set of python libraries that go out and scrape metadata about public repos. Where does the CSV file come in? Does the scraper use the CSV file to convert to JSON, which is then fed into the scraper as configuration?

IanLee1521 commented 5 years ago

@RockManJoe64 - My understanding of the request from GSA is that CSV is meant to be an alternate input format. E.g. instead of scraping from GitHubs API, the "scraping" source is the CSV file provided by agencies that still do data calls this way.

StephenQuirolgico commented 4 years ago

@IanLee1521 Having a CSV file that folks could use in lieu of creating a json file (or running the scraper) would be immensely useful at DHS. It would also be easier for us to merge metadata from multiple inventories into a single CSV file then to do so via json.

StephenQuirolgico commented 4 years ago

@RicardoAReyes Is it safe to start using the Code.gov_Metadata_Schema_2.0.0_Template.xlsx file at DHS? Folks are asking for an easier alternative to json/scraper for inventorying code (i.e., a spreadsheet).

IanLee1521 commented 4 years ago

@StephenQuirolgico -- Ricardo has moved on to other endeavors. @jcastle or @saracope are probably the better better folks to ask.

I'm happy to accept a pull request adding this support, but I think that should really best come from those that make / use the spreadsheet (to your other question, I don't know if that template is "official" or not)

leebrian commented 4 years ago

I think that a manual spreadsheet is really not feasible in the long run because of the labor and quality issues that come with the information. Since there are hundreds, if not thousands of items, it would be very expensive to manually keep items up to date (last modified, tags, etc.) so an Excel option should be considered temporary at best while moving toward an automated method. From experience, Excel spreadsheets, manually populated are difficult to process due to the inability for manual data entry to be consistent and reliable so it will likely require multiple additional adaptations.

I agree with Ian that this functionality could be good to add, but I'm having a hard time understanding how much effort should be put into accommodating manual workflow in automated tools.

StephenQuirolgico commented 4 years ago

@leebrian @IanLee1521 - Agree on all points. Just frustrating trying to get (many) departments to run the scraper (or manually develop the json file).

JosephAllen commented 4 years ago

Would it be acceptable to use VBA inside of Excel to generate the json from the data in an opens xlsx, or csv?

leebrian commented 4 years ago

I think it is acceptable, if someone wants to create and maintain a VBA script. This may be complicated to maintain as there are numerous security hoops to jump through (script signing).

It will also require quite a bit of user support to walk someone through installing and running the VBA script. This seems out of scope of scraper (since it's a python project, not vba or excel support), but might be of interest to code.gov if they are willing to host the vba utility (assuming someone writes it).

tinjaw commented 4 years ago

I think that a manual spreadsheet is really not feasible in the long run because of the labor and quality issues that come with the information. Since there are hundreds, if not thousands of items, it would be very expensive to manually keep items up to date (last modified, tags, etc.) so an Excel option should be considered temporary at best while moving toward an automated method. From experience, Excel spreadsheets, manually populated are difficult to process due to the inability for manual data entry to be consistent and reliable so it will likely require multiple additional adaptations.

I agree with Ian that this functionality could be good to add, but I'm having a hard time understanding how much effort should be put into accommodating manual workflow in automated tools.

As with almost everything "government" it is a lack of training. Left to themselves to "just get my job done", they will almost always resort to Excel. Why? For absolutely no good reasons whatsoever. They do it because they can kinda get around in Excel. Because the person doing that job before them did it in Excel and that is how they were trained. People don't enjoy learning "yet another piece of software that will get replaced in a year or two anyway."

The actual conversion of Excel/CSV to JSON is trivial. See https://pandoc.org/.

IOW Excel is what people always use because it is the panacea data collection GUI. Why learn yet another software program when you already know how to enter data into Excel? That is your pushback right there.

Short-term Solution, make a better Excel spreadsheet, based on a template, that is solely used for data entry. You can use various forms of lookups to kinda get a hand on normalizing the input.

Long term, utilize existing open-source programs to formalize "THE" data entry system for ALL government documents. Train people on this single piece of software. Make it a web app, put enough computing power behind it so there is no latency, and then have forms created from a JSON schema automatically. Lowest common denominator stuff. And provide developers with a way to customize the system while still following UX guidelines.

Then hub-and-spoke this or pub-and-sub it to every database the government has.

Technology is not the barrier to these problems. Budget is not a barrier to these projects. The barrier is training and a desperate need to standardize on standards for UX. People don't want to have to learn a dozen systems to do their job.

jcastle-zz commented 3 years ago

I appreciate everyone's thought and discussion on this topic. I vote to close this issue as I don't think there is a feasible or "one size fits all" solution for this problem. I understand why individuals turn to using Excel for creating standard data submissions and there should be some consideration to long-term management of these types of files. Possibly to the point of learning how to use more modern tools and learning how to create JSON files. I don't think we will solve this here.

bettencb commented 3 years ago

You can do this easily with python. I have the code for it if you would like it.

IanLee1521 commented 3 years ago

@bettencb -- The issue is less about having code to pull in an excel spreadsheet in to a json file, but instead forcing a particular format to an excel spreadsheet that would be used by all agencies.

There was some discussion about GSA developing a standard excel spreadsheet that we could then import in from, but based on @jcastle 's response above, it seems like that is no longer in the works.

Therefore I concur with closing this issue. 👍