Closed dlebauer closed 9 years ago
Also:
@dlebauer A few questions:
Is it okay to show rows with missing sitename on the first page if city, lat, and lon exist? If so, changing the default sort column from "scientificname" to "sitename" will solve the problem we have.
Yes, that is okay.
Right now, on the advanced search page, users won't see any data with a more restrictive access_level than that associated with their login, even if it is data that they themselves inserted. Is that ok?
That is okay.
Check that the date column I added is OK (see the pecandev site). Note that the split-out columns (date_year, date_month, date_day) aren't in the view so are not viewable. Rob and I were in favor of getting rid of these, but I don't remember what we decided.
If this is the solution, please present the date in YYYY-MM-DD format, omitting time component. Most of the time, Date is used but time is not, so providing the time is incorrect. This is why we used the alternate formulation of dividing date-time into individual fields. As long as they can see outside of the advanced search page (e.g. when querying traits or yields directly).
I don't know what the best resolution for this is, but I do not believe that the yields table will ever need to indicate time. For the Traits table, the use of time is rare. How hard would it be to separate the date and time fields?
Please document the use of YYYY-01-01 and 18:00 -06:00 to indicate unknown values in the comments associated with the date-time field on these tables.
On Tue, Oct 6, 2015 at 4:41 PM, Scott Rohde notifications@github.com wrote:
@dlebauer https://github.com/dlebauer A few questions:
- Is it okay to show rows with missing sitename on the first page if city, lat, and lon exist? If so, changing the default sort column from "scientificname" to "sitename" will solve the problem we have. (I've done this for now (see the pecandev site). A more comprehensive solution would be to deal with the dangling site references in the traits and yields tables, which is largely responsible for the missing site data you and Steve were seeing. Or we could simple filter these rows out.)
- Right now, on the advanced search page, users won't see any data with a more restrictive access_level than that associated with their login, even if it is data that they themselves inserted. Is that ok?
- Check that the date column I added is OK (see the pecandev site). Note that the split-out columns (date_year, date_month, date_day) aren't in the view so are not viewable. Rob and I were in favor of getting rid of these, but I don't remember what we decided.
— Reply to this email directly or view it on GitHub https://github.com/PecanProject/bety/issues/350#issuecomment-146009858.
@dlebauer OK, I've changed the date column to always display the date only in YYYY-MM-DD format. I do this by converting the date-time to UTC and then truncating the string to show only the date portion. (If we had more consistently used dateloc, I would have recommended displaying as much information as warranted by the dateloc value along with an indication of how precise the value is.) This is live on pecandev now.
Regarding the header of the CSV download files, it currently looks something like this:
# LeBauer, David, Michael Dietze, Rob Kooper, Steven Long, Patrick Mulrooney, Gareth Scott Rohde, Dan Wang (2010). Biofuel Ecophysiological Traits and Yields Database (BETYdb), Energy Biosciences Institute, University of Illinois at Urbana-Champaign. doi:10.13012/J8H41PB9 All public data in BETYdb is made available under the Open Data Commons Attribution License (ODC-By) v1.0. You are free to share, create, and adapt its contents. Data with an access_level field and value <= 2 is is not covered by this license but may be available for use with consent.
#
# Contact: dlebauer@illinois.edu
#
# SQL query: SELECT "traits_and_yields_view_private".* FROM "traits_and_yields_view_private" WHERE (traits_and_yields_view_private.access_level >= 1) AND ((LOWER(sitename) LIKE LOWER('%cottongrass%') OR LOWER(city) LIKE LOWER('%cottongrass%') OR LOWER(scientificname) LIKE LOWER('%cottongrass%') OR LOWER(commonname) LIKE LOWER('%cottongrass%') OR LOWER(author) LIKE LOWER('%cottongrass%') OR LOWER(trait) LIKE LOWER('%cottongrass%') OR LOWER(trait_description) LIKE LOWER('%cottongrass%') OR citation_year::text = 'cottongrass') AND (LOWER(sitename) LIKE LOWER('%Toolik%') OR LOWER(city) LIKE LOWER('%Toolik%') OR LOWER(scientificname) LIKE LOWER('%Toolik%') OR LOWER(commonname) LIKE LOWER('%Toolik%') OR LOWER(author) LIKE LOWER('%Toolik%') OR LOWER(trait) LIKE LOWER('%Toolik%') OR LOWER(trait_description) LIKE LOWER('%Toolik%') OR citation_year::text = 'Toolik') AND (LOWER(sitename) LIKE LOWER('%arctic%') OR LOWER(city) LIKE LOWER('%arctic%') OR LOWER(scientificname) LIKE LOWER('%arctic%') OR LOWER(commonname) LIKE LOWER('%arctic%') OR LOWER(author) LIKE LOWER('%arctic%') OR LOWER(trait) LIKE LOWER('%arctic%') OR LOWER(trait_description) LIKE LOWER('%arctic%') OR citation_year::text = 'arctic')) AND (checked >= 0)
#
# Date of query: 10/7/2015 10:56:42 AM
#
Are you saying you want the e-mail address "dlebauer@illinois.edu" changed to "betydb@gmail.com"?
As for a note in the header "explaining these fields", did you mean just the new "checked" field, or do you want an explanation for all fields?
It occurs to me that the time-of-query information should include the time zone and perhaps should always be in UTC.
Also note that the SQL query displayed does not include the "ORDER BY" clause of the actual query used.
Are you saying you want the e-mail address "dlebauer@illinois.edu" changed to "betydb@gmail.com"?
this is okay for now
As for a note in the header "explaining these fields", did you mean just the new "checked" field, or do you want an explanation for all fields?
The key is to provide information about the "Checked" field.
How hard would it be to query and write information (meta-data) contained in the comments of each field in the database? That would make it easier for me to update the meta-data that is returned.
It occurs to me that the time-of-query information should include the time zone and perhaps should always be in UTC.
UTC is fine. Whatever is easiest.
Also note that the SQL query displayed does not include the "ORDER BY" clause of the actual query used.
That is okay.
You can do raw SQL queries from Rails, so we just have to write a query to get the column comment(s) we want and then output that information where it is wanted. (The SQL query to retrieve the comment on traits.n
, for example, is
SELECT pgd.description
FROM pg_catalog.pg_statio_all_tables as st
INNER JOIN pg_catalog.pg_description pgd
ON pgd.objoid = st.relid
INNER JOIN information_schema.columns c
ON (pgd.objsubid = c.ordinal_position
AND c.table_schema = st.schemaname
AND c.table_name = st.relname)
WHERE c.table_schema = 'public'
AND c.table_name = 'traits' -- change this as needed
AND c.column_name = 'n'; -- change this as needed
)
We have been putting comments in the migrations, but this may be a bit unwieldy if you are wanting to update them frequently. I think it might be OK to add, delete, and modify comments outside of migrations provided that we make a point of regenerating production_structure.sql
whenever we do so, or at least on a regular basis. It might be worth checking how Rob's sync script handles them if you want to ensure they get broadcast to all database copies.
By the way, I think (though I'm not 100% sure) that view columns can't have comments, though views themselves can. So we might have to extract comments from the underlying traits and yields tables.
Time-of-query display for CSV downloads is now UTC.
@dlebauer Assigning to you for now. I've done all I can with current information except for merging changes into master and deploying.
Lets try to keep this simple. Perhaps we can provide a link to "metadata" with xml and html (in browser) options:
https://gist.github.com/dlebauer/1eb237bbf3abcd2265d6
Or its HTML variant https://uofi.box.com/eml-example
I think EML is a good idea, though it has a bit of a learning curve to use it right, especially if you want anything close to a comprehensive description. (But maybe using it at even the most basic level is better than not using it at all.)
And I am not sure I quite understand how you envision using it in this context--would you use a single metadata description for all potential download sets? (If so, this would necessarily leave out the SQL query used to extract the set. XML does, though, have the concept of inclusion whereby a file representing an XML document can include an XML fragment from another file, so perhaps there is some good solution using this XML feature. Alternatively, you could have a fixed template for the EML file where the data is included inline and the portions of the metadata that vary are generated dynamically. This is more "expensive" in terms of how much needs to be downloaded each time, but I like the idea of keeping the meta-data with the dataset since these datasets will be floating around "out in space" rather than having some fixed well-defined location.
Your sample EML file didn't validate (it needed a "contact" element, and some of the id attribute values were used multiple times within the same document; plus there was a typo in a name). I've attached a patched-up version that _does_validate. (So it's at least syntactically if not semantically correct.)
Also, I was wondering how you created the EML file. Did you use Morpho?
yes, I used Morpho. The morpho files are here: https://uofi.box.com/eml
On Fri, Oct 9, 2015 at 12:12 PM, Scott Rohde notifications@github.com wrote:
Also, I was wondering how you created the EML file. Did you use Morpho?
— Reply to this email directly or view it on GitHub https://github.com/PecanProject/bety/issues/350#issuecomment-146935222.
This is done (and released) except for any enhancements/changes we may wish to make to the metadata and/or format of download files and except for the task pertaining to @ropensci/traits package functions.
@gsrohde could you please update the API documentation (https://www.authorea.com/users/5574/articles/7062/_show_article) with instructions on how to access the unchecked data?
@dlebauer I've added a section (4.3.1) to the API documentation that tells how to include unchecked data. Possibly some earlier sections in this document still have to be revised to be consistent with this. While I was at it, I revised much of this document to clarify how API searches work, particularly the section on cross-table searches and the (new) section devoted specifically to searches of traits_and_yields_view. (I might have made a more thorough-going revision of the whole document if I didn't find Authorea so hard to use.)
note that Authorea can be edited on your desktop. That said, I've been considering moving all of the documentation to a gitbook format, which is a very simple structure that I am using e.g. for github.com/terraref/documentation and has a nice interface at gitbook.com. It supports mathjax, which was the original motivation for leaving the github wiki. We could combine all of the documentation bits into a single book with each bit being a chapter
On Tue, Oct 20, 2015 at 5:41 PM, Scott Rohde notifications@github.com wrote:
@dlebauer https://github.com/dlebauer I've added a section (4.3.1) to the API documentation that tells how to include unchecked data. Possibly some earlier sections in this document still have to be revised to be consistent with this. While I was at it, I revised much of this document to clarify how API searches work, particularly the section on cross-table searches and the (new) section devoted specifically to searches of traits_and_yields_view. (I might have made a more thorough-going revision of the whole document if I didn't find Authorea so hard to use.)
— Reply to this email directly or view it on GitHub https://github.com/PecanProject/bety/issues/350#issuecomment-149723151.
Adding ability to query 'unchecked' data from the rOpenSci traits package has been moved to ropensci/traits#52
Per user request:
Add note to header explaining these fields, with betydb@gmail.com email addressnote in principle, we could also export records "known to be in error" as long as they are clearly labeled and excluded from e.g. r traits package by default