edgi-govdata-archiving / web-monitoring

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")
Creative Commons Attribution Share Alike 4.0 International
106 stars 17 forks source link

Document the differences in the data format of the different sources (PageFreezer, Versionista). #46

Closed titaniumbones closed 1 year ago

titaniumbones commented 7 years ago

We are building a flexible framework designed to accommodate a variety of crawled page snapshots. Different services produce different data formats. By documenting them carefully, we set ourselves up for success.

Mr0grog commented 7 years ago

@titaniumbones are you talking differences in the actual systems themselves or the data we are storing and making public? If the latter, that is documented here (in the source_metadata item): https://github.com/edgi-govdata-archiving/web-monitoring#versions

There’s no pagefreezer info there yet because we do not have a consistent format to document for it yet.

Mr0grog commented 7 years ago

There’s no pagefreezer info there yet because we do not have a consistent format to document for it yet.

Same for IA, too.

titaniumbones commented 7 years ago

@Mr0grog I think it was the data in the actual systems themselves. But... good question! Um. This came out of the GSoC meeting, where we were trying to turn the candidates' proposals into issues which could be grouped into milestones. So I think @janakrajchadha @suchthis @mhucka @danielballan will probably refine the issue together!

danielballan commented 7 years ago

@janakrajchadha Your proposal included "Understand the differences...." Once the differences are clear in your mind, documentation can be a concrete achievement for this task.

danielballan commented 7 years ago

Oops, @Mr0grog's comment and the subsequent ones hadn't loaded for me when I posted the above. Yes, the task here is to both determine and document source_metadata for PF and IA.

Mr0grog commented 7 years ago

Ah, sorry! Didn’t realize this was coming out of another discussion. 👍

janakrajchadha commented 7 years ago

@danielballan @titaniumbones Can either one of you redirect me to the place where a similar thing has been done for Versionista (if it exists)?

Mr0grog commented 7 years ago

@janakrajchadha in terms of the raw data we can get out of Versionista, that’s never been documented:

You could check out a recent output file to get a feel, though: https://s3-us-west-2.amazonaws.com/edgi-versionista-archive/versionista1/metadata-2017-06-20T00%3A00Z.json

janakrajchadha commented 7 years ago

@titaniumbones are you talking differences in the actual systems themselves or the data we are storing and making public? If the latter, that is documented here (in the source_metadata item): https://github.com/edgi-govdata-archiving/web-monitoring#versions

@Mr0grog I may be wrong here, but what we're storing as versions contains the different fields which we want in our DB and the source_metadata is what we get from the source itself. After taking a look at the recent Versionista output file, I would say that the source_metadata for a general case of Versionista output is in fact documented well. How is the data which we are storing and making public (in the source_metadata field) different from the data in the actual system output? Am I confusing terms here?

Mr0grog commented 7 years ago

source_metadata is what we get from the source itself

It‘s close, but not exactly the same. source_metadata doesn’t include fields that are already represented in the page and version records and also flattens some fields (e.g. diff.hashdiff_hash). See here for the script that converts raw Versionista scraper output to DB input: https://github.com/edgi-govdata-archiving/web-monitoring-versionista-scraper/blob/master/bin/import-to-db#L55-L77

janakrajchadha commented 7 years ago

@Mr0grog Oh, that was a little confusing earlier because of the term source_metadata being used for different things. Thanks for the clarification!

I was just adding the information in the documentation for some of the fields in the PageFreezer output and I was wondering if summarizing the internal info is a better way to go. Since we were talking about the data in the actual systems themselves, there are some fields which do not concern us. I just wanted to get a sense of how detailed we want this documentation to be. @danielballan @titaniumbones @suchthis @mhucka

janakrajchadha commented 7 years ago

@titaniumbones @mhucka @suchthis @danielballan @Mr0grog I've documented the data format of the different sources and I've also added a table for differences between them. A few fields don't have a description as I wasn't sure what they meant. Also, there's another IPython notebook which can be used to view an example of output for IA and PF. I've added a link to a Versionista output in the document itself. See https://github.com/edgi-govdata-archiving/web-monitoring-processing/pull/60/commits/d92164f553485b099f9b73d75cf3d0a2f067aced Please review

janakrajchadha commented 7 years ago

@danielballan Should this be closed or should we keep this open as I still have to add a little more information to the document?

danielballan commented 7 years ago

Let's leave it open to track our progress. Would you enumerate the blank entries here? Then we can ask for external help.

janakrajchadha commented 7 years ago

PageFreezer

Versionista

Mr0grog commented 7 years ago

A few fields don't have a description as I wasn't sure what they meant…

Versionista

  • diffWithPreviousDate
  • diffWithFirstDate

These two fields are kind of weird and are sort of a result of the CSVs that analysts are currently using.

diffWithFirstDate is listed in the CSVs as the “date” of the diff between the current version and the first-ever-captured version. Diffs don’t really have a date, though, so this is actually just the capture date of the first-ever captured version of this page.

diffWithPreviousDate is the date of the current version (not the date of the previous version being diffed with, as you might expect from the name).

Mr0grog commented 7 years ago

Three other minor notes:

janakrajchadha commented 7 years ago

Thanks a lot @Mr0grog! The date fields are ambiguous.

hash is missing (I think it got accidentally combined with filePath above)

Yeah, I probably mixed this up as the hash and filePath are kept as a single object in the output.

Mr0grog commented 7 years ago

Hmmm, hash and filePath should not be a single object. Are there metadata files where they are? If so, we should correct those.

janakrajchadha commented 7 years ago

The hash and path of the diff are in a single object. The version hash and filePath aren't. I think I confused those two. Apologies.

janakrajchadha commented 7 years ago

@Mr0grog I think Internet Archive and Versionista have been well documented here. There are a few fields missing in the PageFreezer part and I was hoping that we could bring up the topic of them providing us better API documentation in the coming discussions with them. cc: @ambergman

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Mr0grog commented 5 years ago

We have source_metadata_versionista documented in the DB docs; we should probably do the same for source_metadata_web_monitoring. Or we should document that info somewhere else. In any case, this still seems relevant.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Mr0grog commented 1 year ago

At this point, I’m just going to close this. The project is shutting down.