PecanProject / bety

Web-interface to the Biofuel Ecophysiological Traits and Yields Database (used by PEcAn and TERRA REF)
https://www.betydb.org
BSD 3-Clause "New" or "Revised" License
16 stars 38 forks source link

Export "unchecked" records via web interface and API #350

Closed dlebauer closed 8 years ago

dlebauer commented 8 years ago

Per user request:

  1. do not filter by QA/QC status (if user sets "include_unchecked" to true)
    • [x] add "checked" field to first column of exported (downloaded, queried via API) files, with "checked" and "unchecked" as values
    • [x] sort records so that 'unchecked' values appear at the end
  2. [x] Add note to header explaining these fields, with betydb@gmail.com email address
  3. [x] @ropensci/traits package functions should return only "checked" by default, but add flag to return other levels (@dlebauer) (moved to ropensci/traits#52)

note in principle, we could also export records "known to be in error" as long as they are clearly labeled and excluded from e.g. r traits package by default

gsrohde commented 8 years ago

Also:

  1. Avoid showing rows with missing site data near the top of the search results.
  2. Add a date column to the table.
gsrohde commented 8 years ago

@dlebauer A few questions:

  1. Is it okay to show rows with missing sitename on the first page if city, lat, and lon exist? If so, changing the default sort column from "scientificname" to "sitename" will solve the problem we have. (I've done this for now (see the pecandev site). A more comprehensive solution would be to deal with the dangling site references in the traits and yields tables, which is largely responsible for the missing site data you and Steve were seeing. Or we could simple filter these rows out.)
  2. Right now, on the advanced search page, users won't see any data with a more restrictive access_level than that associated with their login, even if it is data that they themselves inserted. Is that ok?
  3. Check that the date column I added is OK (see the pecandev site). Note that the split-out columns (date_year, date_month, date_day) aren't in the view so are not viewable. Rob and I were in favor of getting rid of these, but I don't remember what we decided.
dlebauer commented 8 years ago

Is it okay to show rows with missing sitename on the first page if city, lat, and lon exist? If so, changing the default sort column from "scientificname" to "sitename" will solve the problem we have.

Yes, that is okay.

Right now, on the advanced search page, users won't see any data with a more restrictive access_level than that associated with their login, even if it is data that they themselves inserted. Is that ok?

That is okay.

Check that the date column I added is OK (see the pecandev site). Note that the split-out columns (date_year, date_month, date_day) aren't in the view so are not viewable. Rob and I were in favor of getting rid of these, but I don't remember what we decided.

If this is the solution, please present the date in YYYY-MM-DD format, omitting time component. Most of the time, Date is used but time is not, so providing the time is incorrect. This is why we used the alternate formulation of dividing date-time into individual fields. As long as they can see outside of the advanced search page (e.g. when querying traits or yields directly).

I don't know what the best resolution for this is, but I do not believe that the yields table will ever need to indicate time. For the Traits table, the use of time is rare. How hard would it be to separate the date and time fields?

Please document the use of YYYY-01-01 and 18:00 -06:00 to indicate unknown values in the comments associated with the date-time field on these tables.

On Tue, Oct 6, 2015 at 4:41 PM, Scott Rohde notifications@github.com wrote:

@dlebauer https://github.com/dlebauer A few questions:

  1. Is it okay to show rows with missing sitename on the first page if city, lat, and lon exist? If so, changing the default sort column from "scientificname" to "sitename" will solve the problem we have. (I've done this for now (see the pecandev site). A more comprehensive solution would be to deal with the dangling site references in the traits and yields tables, which is largely responsible for the missing site data you and Steve were seeing. Or we could simple filter these rows out.)
  2. Right now, on the advanced search page, users won't see any data with a more restrictive access_level than that associated with their login, even if it is data that they themselves inserted. Is that ok?
  3. Check that the date column I added is OK (see the pecandev site). Note that the split-out columns (date_year, date_month, date_day) aren't in the view so are not viewable. Rob and I were in favor of getting rid of these, but I don't remember what we decided.

— Reply to this email directly or view it on GitHub https://github.com/PecanProject/bety/issues/350#issuecomment-146009858.

gsrohde commented 8 years ago

@dlebauer OK, I've changed the date column to always display the date only in YYYY-MM-DD format. I do this by converting the date-time to UTC and then truncating the string to show only the date portion. (If we had more consistently used dateloc, I would have recommended displaying as much information as warranted by the dateloc value along with an indication of how precise the value is.) This is live on pecandev now.

Regarding the header of the CSV download files, it currently looks something like this:

# LeBauer, David, Michael Dietze, Rob Kooper, Steven Long, Patrick Mulrooney, Gareth Scott Rohde, Dan Wang (2010).  Biofuel Ecophysiological Traits and Yields Database (BETYdb), Energy Biosciences Institute, University of Illinois at Urbana-Champaign. doi:10.13012/J8H41PB9 All public data in BETYdb is made available under the Open Data Commons Attribution License (ODC-By) v1.0. You are free to share, create, and adapt its contents. Data with an access_level field and value <= 2 is is not covered by this license but may be available for use with consent.
#
# Contact: dlebauer@illinois.edu
#
# SQL query: SELECT "traits_and_yields_view_private".* FROM "traits_and_yields_view_private"  WHERE (traits_and_yields_view_private.access_level >= 1) AND ((LOWER(sitename) LIKE LOWER('%cottongrass%') OR LOWER(city) LIKE LOWER('%cottongrass%') OR LOWER(scientificname) LIKE LOWER('%cottongrass%') OR LOWER(commonname) LIKE LOWER('%cottongrass%') OR LOWER(author) LIKE LOWER('%cottongrass%') OR LOWER(trait) LIKE LOWER('%cottongrass%') OR LOWER(trait_description) LIKE LOWER('%cottongrass%') OR citation_year::text = 'cottongrass') AND (LOWER(sitename) LIKE LOWER('%Toolik%') OR LOWER(city) LIKE LOWER('%Toolik%') OR LOWER(scientificname) LIKE LOWER('%Toolik%') OR LOWER(commonname) LIKE LOWER('%Toolik%') OR LOWER(author) LIKE LOWER('%Toolik%') OR LOWER(trait) LIKE LOWER('%Toolik%') OR LOWER(trait_description) LIKE LOWER('%Toolik%') OR citation_year::text = 'Toolik') AND (LOWER(sitename) LIKE LOWER('%arctic%') OR LOWER(city) LIKE LOWER('%arctic%') OR LOWER(scientificname) LIKE LOWER('%arctic%') OR LOWER(commonname) LIKE LOWER('%arctic%') OR LOWER(author) LIKE LOWER('%arctic%') OR LOWER(trait) LIKE LOWER('%arctic%') OR LOWER(trait_description) LIKE LOWER('%arctic%') OR citation_year::text = 'arctic')) AND (checked >= 0)
#
# Date of query: 10/7/2015  10:56:42 AM
#

Are you saying you want the e-mail address "dlebauer@illinois.edu" changed to "betydb@gmail.com"?

As for a note in the header "explaining these fields", did you mean just the new "checked" field, or do you want an explanation for all fields?

It occurs to me that the time-of-query information should include the time zone and perhaps should always be in UTC.

Also note that the SQL query displayed does not include the "ORDER BY" clause of the actual query used.

dlebauer commented 8 years ago

Are you saying you want the e-mail address "dlebauer@illinois.edu" changed to "betydb@gmail.com"?

this is okay for now

As for a note in the header "explaining these fields", did you mean just the new "checked" field, or do you want an explanation for all fields?

The key is to provide information about the "Checked" field.

How hard would it be to query and write information (meta-data) contained in the comments of each field in the database? That would make it easier for me to update the meta-data that is returned.

It occurs to me that the time-of-query information should include the time zone and perhaps should always be in UTC.

UTC is fine. Whatever is easiest.

Also note that the SQL query displayed does not include the "ORDER BY" clause of the actual query used.

That is okay.

gsrohde commented 8 years ago

You can do raw SQL queries from Rails, so we just have to write a query to get the column comment(s) we want and then output that information where it is wanted. (The SQL query to retrieve the comment on traits.n, for example, is

SELECT pgd.description
       FROM pg_catalog.pg_statio_all_tables as st
            INNER JOIN pg_catalog.pg_description pgd
                  ON pgd.objoid = st.relid
            INNER JOIN information_schema.columns c
                  ON (pgd.objsubid = c.ordinal_position
                      AND c.table_schema = st.schemaname
                      AND c.table_name = st.relname)
WHERE c.table_schema = 'public'
      AND c.table_name = 'traits' -- change this as needed
      AND c.column_name = 'n'; -- change this as needed

)

We have been putting comments in the migrations, but this may be a bit unwieldy if you are wanting to update them frequently. I think it might be OK to add, delete, and modify comments outside of migrations provided that we make a point of regenerating production_structure.sql whenever we do so, or at least on a regular basis. It might be worth checking how Rob's sync script handles them if you want to ensure they get broadcast to all database copies.

gsrohde commented 8 years ago

By the way, I think (though I'm not 100% sure) that view columns can't have comments, though views themselves can. So we might have to extract comments from the underlying traits and yields tables.

gsrohde commented 8 years ago

Time-of-query display for CSV downloads is now UTC.

gsrohde commented 8 years ago

@dlebauer Assigning to you for now. I've done all I can with current information except for merging changes into master and deploying.

dlebauer commented 8 years ago

Lets try to keep this simple. Perhaps we can provide a link to "metadata" with xml and html (in browser) options:

https://gist.github.com/dlebauer/1eb237bbf3abcd2265d6

Or its HTML variant https://uofi.box.com/eml-example

gsrohde commented 8 years ago

I think EML is a good idea, though it has a bit of a learning curve to use it right, especially if you want anything close to a comprehensive description. (But maybe using it at even the most basic level is better than not using it at all.)

And I am not sure I quite understand how you envision using it in this context--would you use a single metadata description for all potential download sets? (If so, this would necessarily leave out the SQL query used to extract the set. XML does, though, have the concept of inclusion whereby a file representing an XML document can include an XML fragment from another file, so perhaps there is some good solution using this XML feature. Alternatively, you could have a fixed template for the EML file where the data is included inline and the portions of the metadata that vary are generated dynamically. This is more "expensive" in terms of how much needs to be downloaded each time, but I like the idea of keeping the meta-data with the dataset since these datasets will be floating around "out in space" rather than having some fixed well-defined location.

Your sample EML file didn't validate (it needed a "contact" element, and some of the id attribute values were used multiple times within the same document; plus there was a typo in a name). I've attached a patched-up version that _does_validate. (So it's at least syntactically if not semantically correct.)

bety_search_eml-fixed.xml.txt

gsrohde commented 8 years ago

Also, I was wondering how you created the EML file. Did you use Morpho?

dlebauer commented 8 years ago

yes, I used Morpho. The morpho files are here: https://uofi.box.com/eml

On Fri, Oct 9, 2015 at 12:12 PM, Scott Rohde notifications@github.com wrote:

Also, I was wondering how you created the EML file. Did you use Morpho?

— Reply to this email directly or view it on GitHub https://github.com/PecanProject/bety/issues/350#issuecomment-146935222.

gsrohde commented 8 years ago

This is done (and released) except for any enhancements/changes we may wish to make to the metadata and/or format of download files and except for the task pertaining to @ropensci/traits package functions.

dlebauer commented 8 years ago

@gsrohde could you please update the API documentation (https://www.authorea.com/users/5574/articles/7062/_show_article) with instructions on how to access the unchecked data?

gsrohde commented 8 years ago

@dlebauer I've added a section (4.3.1) to the API documentation that tells how to include unchecked data. Possibly some earlier sections in this document still have to be revised to be consistent with this. While I was at it, I revised much of this document to clarify how API searches work, particularly the section on cross-table searches and the (new) section devoted specifically to searches of traits_and_yields_view. (I might have made a more thorough-going revision of the whole document if I didn't find Authorea so hard to use.)

dlebauer commented 8 years ago

note that Authorea can be edited on your desktop. That said, I've been considering moving all of the documentation to a gitbook format, which is a very simple structure that I am using e.g. for github.com/terraref/documentation and has a nice interface at gitbook.com. It supports mathjax, which was the original motivation for leaving the github wiki. We could combine all of the documentation bits into a single book with each bit being a chapter

On Tue, Oct 20, 2015 at 5:41 PM, Scott Rohde notifications@github.com wrote:

@dlebauer https://github.com/dlebauer I've added a section (4.3.1) to the API documentation that tells how to include unchecked data. Possibly some earlier sections in this document still have to be revised to be consistent with this. While I was at it, I revised much of this document to clarify how API searches work, particularly the section on cross-table searches and the (new) section devoted specifically to searches of traits_and_yields_view. (I might have made a more thorough-going revision of the whole document if I didn't find Authorea so hard to use.)

— Reply to this email directly or view it on GitHub https://github.com/PecanProject/bety/issues/350#issuecomment-149723151.

dlebauer commented 8 years ago

Adding ability to query 'unchecked' data from the rOpenSci traits package has been moved to ropensci/traits#52