IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
885 stars 495 forks source link

Dynamic Custom Homepage - ROUND TWO #5445

Closed mheppler closed 5 years ago

mheppler commented 5 years ago

Misc HTML + CSS + layout improvements

Javascript fixes

Other customization fixes

Homepage template fixes

Additional curation efforts

Related GitHub Issues

Updated Activity section

screen shot 2019-01-10 at 2 17 25 pm

Misc notes...

scolapasta commented 5 years ago

Related to the "Activity download count being off" to-do list item: #4970

matthew-a-dunlap commented 5 years ago

For the "Activity" download counts problem, a short-term fix could be to just remove that section of the html until we get the metrics to line up in a future release

scolapasta commented 5 years ago

Should Search input watermark dataset count be 27.4k (# of datasets added?) or 81.2k (total number including harvested)? I'd vote for the latter

djbrooke commented 5 years ago

There's some more feedback coming from @mercecrosas for this issue. @TaniaSchlatter will add it tomorrow morning.

sbarbosadataverse commented 5 years ago

I sent feedback to Tania already

On Wed, Jan 9, 2019 at 5:20 PM Danny Brooke notifications@github.com wrote:

There's some more feedback coming from @mercecrosas https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mercecrosas&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=UDQVlWArn9ZcroUbg3pGINCl4BsWb6XANgswCoR_PuI&s=sLEa7Z1UCnfsqoTqFuaBQ_Ow9Kq_4aqx0OfYEpTD9Xo&e= for this issue. @TaniaSchlatter https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_TaniaSchlatter&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=UDQVlWArn9ZcroUbg3pGINCl4BsWb6XANgswCoR_PuI&s=oBG_8dY2qp9ERcTej0liOFrNBqZQl24tYVe0bF9BT9Y&e= will add it tomorrow morning.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_5445-23issuecomment-2D452890800&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=UDQVlWArn9ZcroUbg3pGINCl4BsWb6XANgswCoR_PuI&s=dqN9uy5P0UH0UieQoazqQdwmazI5L0_SZqpZFId37J0&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AH79KktQDc-2DRaZve2S1MgT7OsbZJgCceks5vBmslgaJpZM4Z38kX&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=UDQVlWArn9ZcroUbg3pGINCl4BsWb6XANgswCoR_PuI&s=Yu1fYPnUgITDWyemYYF8Y-Gkr4nTVtUL1QaT1Hjy7jk&e= .

-- Sonia Barbosa Manager of Data Curation, The Dataverse Project Manager of the Murray Research Archive, IQSS Data Science Harvard University

All Harvard Dataverse Repository inquiries should be sent to: support@dataverse.harvard.edu All software inquiries should be sent to: support@dataverse.org

Need to deposit data? Visit http://dataverse.harvard.edu

All test dataverses should be created in our demo environment: https://demo.dataverse.org/

Join our Dataverse Community! https://groups.google.com/forum/#!forum/dataverse-communit https://groups.google.com/forum/#!forum/dataverse-communityy

TaniaSchlatter commented 5 years ago

@scolapasta Search input watermark dataset count should be @ 81.2k – total number including harvested.

mheppler commented 5 years ago

Wanted to record this Stack Overflow resource for new column CSS properties used in the subject count and recent dataset sections.

djbrooke commented 5 years ago

To-Do List from most recent design review

Other customization fixes

mheppler commented 5 years ago

Added new noscript error alert msg block to header and bundle.

screen shot 2019-01-16 at 11 08 33 am

This was discussed with the development team and decided it best belong as part of the site-wide dataverse_header.xhtml, which the template includes for every page. The "error" styling was also required as some important features like file download or any feature linked behind a button dropdown menu does not work or is inaccessible.

mheppler commented 5 years ago

Moved this issue, and it's sibling issue Homepage Count Updates #5447 into Code Review along with the PR #5475.

As outlined above, there are still outstanding customization and curation effort, to-do items that will be completed outside of this issue. Those will coordinated with @kcondon as part of the procedure for moving the dynamic custom hmpg back to production.

screenshot-2019-1-17 root 2

TaniaSchlatter commented 5 years ago

Feedback from review:

screen shot 2019-01-31 at 11 07 01 am
dlmurphy commented 5 years ago

Noticed a typo in the upper right of the page:

"A dataverse is container for all..."

should be:

"A dataverse is a container for all..."

mheppler commented 5 years ago

Fixed revisions requested above, except for the responsive behavior. Hoping to learn more about expected behavior with production/dynamic data.

Updated the Harvard Dataverse Customization documentation in Google Drive. Reviewing those resources with @kcondon to config on the test server.

mheppler commented 5 years ago

Got an approval today on the layout revisions. Passing to QA.

kcondon commented 5 years ago

[ ] Quick check of numbers shows some significant differences: non harvested: custom homepage: 25,494 orig homepage facet: 28,329 diff: 2,835 harvested: custom homepage: 16,391 orig homepage facet: 53,393 diff: 37,002

This was tested after a clean reindex, not logged in.

For reference, custom homepage reports published dataset counts, including harvested from db query.

Original homepage facets reports all datasets viewable by user currently logged in at whichever dataverse user is viewing and counts published and draft version as 2. So, a good comparison would be not logged in user at root dataverse.

There are possibly some marginal differences due to caching, failure to index some datasets, and failure to expunge some deleted datasets from index but these should be relatively small and partially corrected by clearing cache and doing a clean reindex. There are 9 datasets that failed to index based on indextime being null in this test db.

[ ] There are other, less sizeable differences in the by subject categories but still noticeable, on the scale of 10-20 for 2 subjects and 1 for most. Will check these a little more closely. So, upon closer inspection these by subject values on the custom homepage include published dataverses and datasets. When I choose published facet the numbers are fairly close.

[ ] Harvested datasets in last 30 days also appears to remain at 0, though harvest happened yesterday.

To set cache timeout to 1 minute: curl -X PUT -d 1 http://localhost:8080/api/admin/settings/:MetricsCacheTimeoutMinutes

To clear all metrics values cache: curl -X DELETE http://localhost:8080/api/admin/clearMetricsCache

[ ] db update script needs to be renamed to 4.11

kcondon commented 5 years ago

Here is a list of metrics and their facet counterparts:

                  Custom Homepage Stats     Original Homepage Facets/Stats  Diff

Downloads 3,945,611 3,945,611 0 Dataverses 2,843 2,842 1 Top Total Datasets 41,885 78,879 36,994 Locally Deposited 25,494 25,487 7 Harvested 16,391 53,392 37,001

By Subject Argicultural Sciences 870 869 1 Arts and Humanities 602 601 1 Astronomy and Human 451 445 6 Business and Mgt 285 285 0 Chemistry 112 111 1 Comp Science 599 598 1 Earth Science 1,067 1,066 1 Engineering 225 244 19 Law 194 184 10 Math Science 132 131 1 Medicine Science 1,984 1,983 1 Physics 110 110 0 Social Science 13,242 13,240 2

matthew-a-dunlap commented 5 years ago

I've made headway on the metrics issue with harvested datasets. There was an issue with the group by subquery which is fixed, but that revealed another issue under the hood.

It looks like many of the released harvested datasets do not have a releasetime (28375 of 53733). This is causing our metrics for total datasets to return wildly off as under the hood we use the same query as the "toMonth" metric, specifying the current month.

I'm not sure how to handle this as its not clear-cut like downloads, where all the undated records were historic. These records without a releasetime span the past 2 years up until present.

This is a query I've been using to view the data:

SELECT * FROM datasetversion 
join dataset on dataset.id = datasetversion.dataset_id
where releasetime is null
and versionstate='RELEASED' 
-- and dataset.harvestingclient_id IS NULL --uncomment to see 0 unharvested
order by datasetversion.id DESC

Maybe you have some ideas @scolapasta ? We could only return a current total for harvested datasets, but at that point we might as well pull out the whole query parameter and just have it as a separate api endpoint. I'm keen to find a different option though.

matthew-a-dunlap commented 5 years ago

Maybe we could use lastUpdateTime if releaseTime doesn't exist for the datasets?

mheppler commented 5 years ago

@matthew-a-dunlap @landreev Is the lastUpdateTime the time stamp that harvesting uses? We're looking for when a time stamp of when the record was added to the Harvard Dataverse, right?

matthew-a-dunlap commented 5 years ago

@mheppler All I know for sure is that all the harvested datasets have a lastUpdateTime.

matthew-a-dunlap commented 5 years ago

I'm breaking down the bySubject numbers on dvn-vm5. Looking at just datasets everything is very close. Solr: screen shot 2019-02-05 at 11 57 19 am Metric: screen shot 2019-02-05 at 11 57 23 am The few differences could be chalked up to indexing issues. Looking into the dataverses query because that looks more problematic.

mheppler commented 5 years ago

There was a change to an icon in the dynamic custom hmpg HTML which will require an update to the Harvard Dataverse customization files that I have set up for Kevin in Google Drive. Just adding this here as a reminder to myself and a heads up to @kcondon.

landreev commented 5 years ago

Regarding the harvested datasets: We do NOT populate the publicationdate of harvested datasets. We only fill the creationdate - and since all the harvested datasets are published by definition, it can be assumed to also be the publicationdate. The harvested datasets in the database that happen to have the publicationdate are the legacy ones that were migrated from DVN3.

We can discuss changing this arrangement separately. But for the purposes of this issue, we should simply go ahead and change the dataset-counting queries to work based on this definition, that all the harvested datasets should be counted as published.

So instead of doing "SELECT ... WHERE ... dvobject.publicationdate IS NOT null" we should be doing "SELECT ... WHERE ... (dvobject.publciationdate IS NOT null OR dataset.harvestingclient_id IS NOT null)"

matthew-a-dunlap commented 5 years ago

@landreev Thanks for investigating this! I'll make the change :)

matthew-a-dunlap commented 5 years ago

I've run into more problems that I thought trying to get all the file/dataset queries to work dynamically for harvested/local. I removed the dataLocation option from all files queries (as we don't use them in homepage anyways) and from dataset/bySubject . The harvest/local/all queryParam for the other dataset queries seems to work well.

After removing this from dataset/bySubject I realized that it was a hard requirement for homepage to get all the results. Talking with @landreev earlier, we agreed that the base query that we had used for datasets/files is a bit confusing and should be rewritten, but I had hoped to avoid doing that as part of the homepage story.

We may be able to sidestep this issue somewhat by writing a different/simpler query that gets the subject counts without caring about the timestamp, and having that return harvest/local. But it'll make the metrics api a bit more confusing and is still work.

I'm out tomorrow and will be unable to work on this. Feel free to revert my last two commits if needed to work on the bySubject query.

matthew-a-dunlap commented 5 years ago

btw, the approach I was trying was to update this section of bySubject/toMonth:

from datasetversion where datasetversion.dataset_id || ':' || datasetversion.versionnumber + (.1 * datasetversion.minorversionnumber) in

removing it to be how the basic toMonth query is now. There may be some problem with this tho as harvested datasets may not have a datasetversion.

landreev commented 5 years ago

I can definitely help figuring out better queries there. Just to confirm that I'm reading this correctly - the "totals" queries are now working correctly (for local, harvested and/or both); and the bySubject query is working correctly for local datasets, but not for harvested ones - ? - I'll look into it.

And yes, it looks like the only harvested datasets that have numeric version numbers are the ones harvested from other Dataverses. The ones harvested from generic OAI archives and such don't. Whether this is a problem necessarily - we need to find out; that fragment in the query:

... ':' || datasetversion.versionnumber + (.1 * datasetversion.minorversionnumber) ...

may simply become a "0" when the version numbers are missing; and it would still uniquely identify the dataset, in combination with the dataset id.

landreev commented 5 years ago

(and yes, the bySubjectToMonth should be the same query as bySubject - but with the time argument added...)

matthew-a-dunlap commented 5 years ago

@landreev thats correct the totals look to be working correct now. Thanks for looking into this.

landreev commented 5 years ago

so yeah, these lines:

datasetversion.dataset_id || ':' || max(datasetversion.versionnumber + (.1 * datasetversion.minorversionnumber))

or

datasetversion.dataset_id || ':' || datasetversion.versionnumber + (.1 * datasetversion.minorversionnumber)

both result in empty strings when versionnumber and/or minorversionnumber are null. so count(*) works - it just counts lines, regardless of the content. But "where ... in ..." using this expression only finds the versions with the version numbers present.

(I'm working on a simpler query)

landreev commented 5 years ago

OK, I haven't really made it simpler per se; I'm still relying on the "max(datasetversion.versionnumber + (.1 * datasetversion.minorversionnumber))" gimmick in order to select the latest released version, for the local datasets (haven't been able to think of a simpler/cleaner query). But I got it to work with harvested datasets, and I used a simpler query for those - that relies on the assumption that all the harvested datasets are published, and that there's only one version per dataset.

(I've only modified the datasets/bySubjectToMonth query; if any other similar queries in there need to be able to select either local, or harvested, or both - they need be similarly modified)