LEDApplications / DEPRECATED-lehd-schema

The draft version of the lehd schema: https://lehd.ces.census.gov/data/schema/
https://ledapplications.github.io/lehd-schema/
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

add schema table containing institution list #28

Closed srt1 closed 3 years ago

srt1 commented 3 years ago

Add the institution list as a metadata table accompanying the PSEO deliverable. Define the file in the schema.

srt1 commented 3 years ago

@andrewfoote I don't see a Redmine ticket (at least, not with category=PSEO).

It is probably appropriate to have both a Redmine and Github ticket. The first is to track code changes needed to generate any new table that belongs in the PSEO deliverable, and the Github one to document that in the schema. Unfortunately the tickets can't be directly linked, since they use separate systems.

_(This comment in response to earlier comment by @andrewfoote in https://github.com/andrewfoote/pseo_issues/issues/31_

andrewfoote commented 3 years ago

I just put it into PSEO (realizing I didn't do it before)

srt1 commented 3 years ago

We may be electing not to include this as part of the formal PSEO deliverable (csv files). I think it will be only distributed on a standalone web page, for the time being. So at this point, we can omit it from the schema.

heathhayward commented 3 years ago

When Andrew and I discussed this last, we talked about having this institution list be a CSV that's delivered with the raw data files (in https://lehd.ces.census.gov/data/pseo/[VINTAGE]/ or https://lehd.ces.census.gov/data/pseo/[VINTAGE]/all/). So I agree it doesn't need to be part of the LEHD schema, but it would be great if it was part of the raw data delivery.

srt1 commented 3 years ago

If it's part of the data delivery (anything under https://lehd.ces.census.gov/data/pseo/ [release]), it needs to be documented in the schema.

If it is a standalone web page only, we can sidestep that.


From: heathhayward @.> Sent: Monday, March 15, 2021 1:34 PM To: LEDApplications/lehd-schema @.> Cc: Stephen R Tibbets (CENSUS/CES FED) @.>; Assign @.> Subject: Re: [LEDApplications/lehd-schema] add schema table containing institution list (#28)

When Andrew and I discussed this last, we talked about having this institution list be a CSV that's delivered with the raw data files (in https://lehd.ces.census.gov/data/pseo/[VINTAGE]/https://lehd.ces.census.gov/data/pseo/%5BVINTAGE%5D/ or https://lehd.ces.census.gov/data/pseo/[VINTAGE]/all/https://lehd.ces.census.gov/data/pseo/%5BVINTAGE%5D/all/). So I agree it doesn't need to be part of the LEHD schema, but it would be great if it was part of the raw data delivery.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/LEDApplications/lehd-schema/issues/28#issuecomment-799608964, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFMXFIOUNPFDIHNHCZLIEDLTDZAJJANCNFSM4SHWU2EQ.

heathhayward commented 3 years ago

OK, I'm fine with either option 1) make an HTML table listing the current institutions or 2) make it part of the data delivery and include it in the schema. Number two is nice because the creation of that table would be part of an automated production process and seems like number one will be adhoc? I'm ok having two institution lists in the schema, although it would be confusing. I'd also be ok with replacing the "all institutions" file with a "participating institutions" file in the schema, since that's what users care about.

srt1 commented 3 years ago

Yes, I was a bit confused about how the ad-hoc step would be handled. Let's be a bit formal about it, then.

This isn't a separate institution list for the schema. The file itself would live in the following location:

https://lehd.ces.census.gov/data/pseo/R2020Q3/all/

It is part of the ALL distribution, only, and the web page can reference it from there.

For the schema, we just need to document the structure of the csv, itself. So the schema is documenting what is in the file, not providing the contents. The list of institutions come only from label_institution.csv. It just happens that this file repeats the names, as the Excel outputs do (as Andrew recently reminded me).


From: heathhayward @.> Sent: Monday, March 15, 2021 1:45 PM To: LEDApplications/lehd-schema @.> Cc: Stephen R Tibbets (CENSUS/CES FED) @.>; Assign @.> Subject: Re: [LEDApplications/lehd-schema] add schema table containing institution list (#28)

OK, I'm fine with either option 1) make an HTML table listing the current institutions or 2) make it part of the data delivery and include it in the schema. Number two is nice because the creation of that table would be part of an automated production process and seems like number one will be adhoc? I'm ok having two institution lists in the schema, although it would be confusing. I'd also be ok with replacing the "all institutions" file with a "participating institutions" file in the schema, since that's what users care about.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/LEDApplications/lehd-schema/issues/28#issuecomment-799617087, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFMXFIN6A26P6KZUELRXUU3TDZBVBANCNFSM4SHWU2EQ.

srt1 commented 3 years ago

(BTW, I realize that the Excel files are not themselves part of the schema. Lars noted this earlier. I defend that on the grounds that it is simply taking the data in the csv, and attaching labels already in the schema. This file Andrew constructed contains data that does not live anywhere else, so we should document it separately)


From: Stephen R Tibbets (CENSUS/CES FED) @.> Sent: Monday, March 15, 2021 1:52 PM To: LEDApplications/lehd-schema @.> Subject: Re: [LEDApplications/lehd-schema] add schema table containing institution list (#28)

Yes, I was a bit confused about how the ad-hoc step would be handled. Let's be a bit formal about it, then.

This isn't a separate institution list for the schema. The file itself would live in the following location:

https://lehd.ces.census.gov/data/pseo/R2020Q3/all/

It is part of the ALL distribution, only, and the web page can reference it from there.

For the schema, we just need to document the structure of the csv, itself. So the schema is documenting what is in the file, not providing the contents. The list of institutions come only from label_institution.csv. It just happens that this file repeats the names, as the Excel outputs do (as Andrew recently reminded me).


From: heathhayward @.> Sent: Monday, March 15, 2021 1:45 PM To: LEDApplications/lehd-schema @.> Cc: Stephen R Tibbets (CENSUS/CES FED) @.>; Assign @.> Subject: Re: [LEDApplications/lehd-schema] add schema table containing institution list (#28)

OK, I'm fine with either option 1) make an HTML table listing the current institutions or 2) make it part of the data delivery and include it in the schema. Number two is nice because the creation of that table would be part of an automated production process and seems like number one will be adhoc? I'm ok having two institution lists in the schema, although it would be confusing. I'd also be ok with replacing the "all institutions" file with a "participating institutions" file in the schema, since that's what users care about.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/LEDApplications/lehd-schema/issues/28#issuecomment-799617087, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFMXFIN6A26P6KZUELRXUU3TDZBVBANCNFSM4SHWU2EQ.

srt1 commented 3 years ago

After seeing the outputs that Andrew had in mind, I do believe that we should include them as part of the release and document them in the data schema. As mentioned, these files should be included as part of the pseo/RyyyyQq/all directory of the release. We can document it in section 8.3. Section 8 describes the version files, as well as additional product-specific metadata we provide. I think this most closely resembles that.

coverage_bystate.csv

Recommended name: pseo_all_partners.csv _[and a bit ??? on the csv part]_

Structurally, this is a really awkward file. The "System Signatories" can have one or more entities in it. That is just ugly. The thing is, there is no reason why we have to constrain ourselves to a simple csv - or even to make it a csv in the first place. Files can have alternate structures. For example:

[state]
[share of IPEDS graduates]
[number of signatories]
[signatory list]

CO
70%
Colorado Department of Higher Education

TX
75%
University of Texas System
Texas Higher Education Coordination Board

...
...

We can document this (or some other) structure in the schema. It's still machine readable, if a user wants it that badly. We can refine it to make it more aesthetic for the website.

Side note: do we really need the coverage in this table? It seems a more natural fit for the "map" table, following, but we can do both ...

coverage_allstates_map.csv

Recommended name: pseo_all_coverage.csv

I would modify the state column to follow whatever standard we have for that. As just mentioned, can we add the share to this table? I wouldn't see the harm in having it in both. Even if the map itself doesn't use it, it's a nice piece of data for users to download.

We will also want to add a new label_share_grads_map.csv file describing the various states of the _share_grads_map) variable (0, 1, 2, 3, 4, 5).

institution_list.csv

Recommended name: pseo_all_institutions.csv

Can we continue to use the variable name INSTITUTION, rather than OPEID here? We have documented the INSTITUTION variable, and the OPEIDs are already included. It would be nice be be internally consistent.

Interested for feedback on this, thanks.

heathhayward commented 3 years ago

I think only two of these files need to be publicly available - the map file's only purpose is to populate the map that Chaoling is building. The institution_list file (which I agree should have the same column names/structure as https://lehd.ces.census.gov/data/schema/V4.7.0/label_institution.csv, especially if it helps reduce the schema documentation burden) and the coverage_bystate file should have all the information users need. We don't, for example, make the input file for the LED partner map publicly available.

It would be great if an input file for the map is created via PSEO production processes, but if that breaks a rule then the web team can manually control/edit the map input file.

And in terms of the coverage_bystate file, I agree it doesn't have to be a CSV - an Excel spreadsheet would give us better formatting options especially if we want to display it as is under state map.

Lastly, maybe I should invite Chaoling to the 11am meeting? I've asked her for specs for the files she needs to replicate the page mockup I've provided, but I haven't heard anything from her yet.

srt1 commented 3 years ago

I have created a new set of metadata files to include along with the PSEO release package. Some highlights:

Andrew and I will put together some schema content for section 8.4 to describe these files.

srt1 commented 3 years ago

Proposed content for 8.4

8.4 Additional metadata for PSEO files

Several additional files within each state release are included to provide information on the institutions within the scope of PSEO. The ALL directory consolidates the individual state files.

8.4.1 PSEO Data Partners and Coverage (pseo_[ST]_partners.txt)

This file contains information on PSEO coverage of graduates, as well as the partner organization(s) providing data. This is presented on several lines of a text file, as follows:

The share is derived from Integrated Postsecondary Education Data System (IPEDS) data, using program graduates from 2015 for degree levels within the scope of PSEO. It calculates the number of graduates from institutions that are available to PSEO as a fraction of graduates from all institutions within IPEDS for the reference state.

A sample file follows:

08 Colorado
72% of statewide graduates covered (2015 estimate)
Colorado Department of Higher Education

[note - use IPEDS link in section 7.3]

8.4.2 Institutions available within PSEO (pseo_[ST]_institutions.csv)

(variables_pseo_institutions.csv)

This file provides the list of institutions that are included in the PSEO release. This file is an extract from label_institution.csv.

The files are structured as follows:

_Include variables_pseoinstutions.csv file, as per section 8.3

_[both csv files should be links. I will upload variables_pseoinstitutions.csv.]

heathhayward commented 3 years ago

A couple comments:

srt1 commented 3 years ago
heathhayward commented 3 years ago

OK. Re: bullet 1, I'm ok leaving it as is. Once Jody implements these text changes, this issue can be closed. Jody let me know if you need something more from me.

jodyhoonstarr commented 3 years ago

Review the docs in a few minutes here

I put a link and a dynamic table using variables_pseo_instutions.csv but it wasn't obvious to me what link was supposed to be in the section for pseo_[ST]_partners.txt above. The example table

08 Colorado
72% of statewide graduates covered (2015 estimate)
Colorado Department of Higher Education

is hardcoded. Let me know if it should point to something dynamic so that it's regenerated when the data updates.

heathhayward commented 3 years ago

OK, I'll take a look. That table can be found in /data/rawdata/pseo/latest_release/co/pseo_co_partners.txt (on DEV only for now)

jodyhoonstarr commented 3 years ago

Are those files going to be dropped into the schema itself or are we considering it data?

heathhayward commented 3 years ago

we are considering it data. Does that mean it has to be hardcoded?

jodyhoonstarr commented 3 years ago

Yeah, for it to be regenerated on the fly it has to be stuck into the 4.8 folder. We can use the hardcoded example and point to the data location using latest_release.

jodyhoonstarr commented 3 years ago

e.g. https://lehd.ces.census.gov/data/pseo/latest_release/co/pseo_co_partners.txt (which won't work until the data is pushed obviously)

heathhayward commented 3 years ago

OK, great. The new 8.4 section looks good to me. Should I run it by Stephen now?

srt1 commented 3 years ago

Section 8.4 looks pretty good to me. The only thing I notice that was missed was that _labelinstitution.csv in 8.4.2 should be a hyperlink to that file in the current schema (same link as 6.13.2)

Also, please update section 9. Changes with a second bullet:

Thanks.

srt1 commented 3 years ago

Trying to catch up with Jody's earlier comment, I'm OK with hardcoding the CO example. If we change the methodology (say, change the reference year), we might want to edit it, but I don't know whether it's worth it to make it dynamic at this time. Note, since the schema does say 2015, we'll have to update the schema anyway, if we change that.

Note, pseo_[ST]_partners.txt is not intended to be a link in the schema. The one other place we reference a file like that is labelgeography[ST].csv. It's a little cludgey, but it's not a link.

I wouldn't consider it a requirement to link directly to each state partner file - the whole idea with making it data is that the partner list (and institution list) itself is not part of the schema. The schema just needs to describe what is in those files. But up to you guys if you want to link to something.

srt1 commented 3 years ago

I think it looks good, after Jody's last edit. One last tweak - see inconsistent capitalization of the word "files". I guess the convention is to use uppercase? (section 8.4)

  1. Metadata

    8.1. Version Metadata for QWI, J2J, and PSEO Files (version.txt) 8.2. Additional Metadata for J2JOD Files (avail.csv) 8.3. Metadata on Indicator Availability 8.4. Additional metadata for PSEO files 8.4.1. PSEO Data Partners and Coverage (pseo_[ST]partners.txt) 8.4.2. Institutions available within PSEO (pseo[ST]_institutions.csv)

jodyhoonstarr commented 3 years ago

Taking a scroll through the rest of the schema the titlecasing is kinda all over the place. Spinning off another ticket to make this more consistent.

srt1 commented 3 years ago

No requirement from me to go through the rigamarole. At your leisure. I just noted that in the TOC.

srt1 commented 3 years ago

BTW, I am signing off on section 8.4. Feel free to close - wasn't sure if you had anything else to do on this.