Closed mheppler closed 5 years ago
Related to the "Activity download count being off" to-do list item: #4970
For the "Activity" download counts problem, a short-term fix could be to just remove that section of the html until we get the metrics to line up in a future release
Should Search input watermark dataset count be 27.4k (# of datasets added?) or 81.2k (total number including harvested)? I'd vote for the latter
There's some more feedback coming from @mercecrosas for this issue. @TaniaSchlatter will add it tomorrow morning.
I sent feedback to Tania already
On Wed, Jan 9, 2019 at 5:20 PM Danny Brooke notifications@github.com wrote:
There's some more feedback coming from @mercecrosas https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mercecrosas&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=UDQVlWArn9ZcroUbg3pGINCl4BsWb6XANgswCoR_PuI&s=sLEa7Z1UCnfsqoTqFuaBQ_Ow9Kq_4aqx0OfYEpTD9Xo&e= for this issue. @TaniaSchlatter https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_TaniaSchlatter&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=UDQVlWArn9ZcroUbg3pGINCl4BsWb6XANgswCoR_PuI&s=oBG_8dY2qp9ERcTej0liOFrNBqZQl24tYVe0bF9BT9Y&e= will add it tomorrow morning.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_5445-23issuecomment-2D452890800&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=UDQVlWArn9ZcroUbg3pGINCl4BsWb6XANgswCoR_PuI&s=dqN9uy5P0UH0UieQoazqQdwmazI5L0_SZqpZFId37J0&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AH79KktQDc-2DRaZve2S1MgT7OsbZJgCceks5vBmslgaJpZM4Z38kX&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=UDQVlWArn9ZcroUbg3pGINCl4BsWb6XANgswCoR_PuI&s=Yu1fYPnUgITDWyemYYF8Y-Gkr4nTVtUL1QaT1Hjy7jk&e= .
-- Sonia Barbosa Manager of Data Curation, The Dataverse Project Manager of the Murray Research Archive, IQSS Data Science Harvard University
All Harvard Dataverse Repository inquiries should be sent to: support@dataverse.harvard.edu All software inquiries should be sent to: support@dataverse.org
Need to deposit data? Visit http://dataverse.harvard.edu
All test dataverses should be created in our demo environment: https://demo.dataverse.org/
Join our Dataverse Community! https://groups.google.com/forum/#!forum/dataverse-communit https://groups.google.com/forum/#!forum/dataverse-communityy
@scolapasta Search input watermark dataset count should be @ 81.2k – total number including harvested.
Wanted to record this Stack Overflow resource for new column CSS properties used in the subject count and recent dataset sections.
To-Do List from most recent design review
Other customization fixes
Added new noscript
error alert msg block to header and bundle.
This was discussed with the development team and decided it best belong as part of the site-wide dataverse_header.xhtml
, which the template includes for every page. The "error" styling was also required as some important features like file download or any feature linked behind a button dropdown menu does not work or is inaccessible.
Moved this issue, and it's sibling issue Homepage Count Updates #5447 into Code Review along with the PR #5475.
As outlined above, there are still outstanding customization and curation effort, to-do items that will be completed outside of this issue. Those will coordinated with @kcondon as part of the procedure for moving the dynamic custom hmpg back to production.
Feedback from review:
[x] remove the dataset thumbnail images
[x] integrate the other customization fixes: Header: add 2px above and below Harvard logo (or make the logo slightly smaller) Header & footer: change both to solid background #ececec
[x] Keep all lines of a dataset together – don't break dataset lines across columns (see image)
Noticed a typo in the upper right of the page:
"A dataverse is container for all..."
should be:
"A dataverse is a container for all..."
Fixed revisions requested above, except for the responsive behavior. Hoping to learn more about expected behavior with production/dynamic data.
Updated the Harvard Dataverse Customization documentation in Google Drive. Reviewing those resources with @kcondon to config on the test server.
Got an approval today on the layout revisions. Passing to QA.
[ ] Quick check of numbers shows some significant differences: non harvested: custom homepage: 25,494 orig homepage facet: 28,329 diff: 2,835 harvested: custom homepage: 16,391 orig homepage facet: 53,393 diff: 37,002
This was tested after a clean reindex, not logged in.
For reference, custom homepage reports published dataset counts, including harvested from db query.
Original homepage facets reports all datasets viewable by user currently logged in at whichever dataverse user is viewing and counts published and draft version as 2. So, a good comparison would be not logged in user at root dataverse.
There are possibly some marginal differences due to caching, failure to index some datasets, and failure to expunge some deleted datasets from index but these should be relatively small and partially corrected by clearing cache and doing a clean reindex. There are 9 datasets that failed to index based on indextime being null in this test db.
[ ] There are other, less sizeable differences in the by subject categories but still noticeable, on the scale of 10-20 for 2 subjects and 1 for most. Will check these a little more closely. So, upon closer inspection these by subject values on the custom homepage include published dataverses and datasets. When I choose published facet the numbers are fairly close.
[ ] Harvested datasets in last 30 days also appears to remain at 0, though harvest happened yesterday.
To set cache timeout to 1 minute: curl -X PUT -d 1 http://localhost:8080/api/admin/settings/:MetricsCacheTimeoutMinutes
To clear all metrics values cache: curl -X DELETE http://localhost:8080/api/admin/clearMetricsCache
[ ] db update script needs to be renamed to 4.11
Here is a list of metrics and their facet counterparts:
Custom Homepage Stats Original Homepage Facets/Stats Diff
Downloads 3,945,611 3,945,611 0 Dataverses 2,843 2,842 1 Top Total Datasets 41,885 78,879 36,994 Locally Deposited 25,494 25,487 7 Harvested 16,391 53,392 37,001
By Subject Argicultural Sciences 870 869 1 Arts and Humanities 602 601 1 Astronomy and Human 451 445 6 Business and Mgt 285 285 0 Chemistry 112 111 1 Comp Science 599 598 1 Earth Science 1,067 1,066 1 Engineering 225 244 19 Law 194 184 10 Math Science 132 131 1 Medicine Science 1,984 1,983 1 Physics 110 110 0 Social Science 13,242 13,240 2
I've made headway on the metrics issue with harvested datasets. There was an issue with the group by
subquery which is fixed, but that revealed another issue under the hood.
It looks like many of the released harvested datasets do not have a releasetime
(28375 of 53733). This is causing our metrics for total datasets to return wildly off as under the hood we use the same query as the "toMonth" metric, specifying the current month.
I'm not sure how to handle this as its not clear-cut like downloads, where all the undated records were historic. These records without a releasetime
span the past 2 years up until present.
This is a query I've been using to view the data:
SELECT * FROM datasetversion
join dataset on dataset.id = datasetversion.dataset_id
where releasetime is null
and versionstate='RELEASED'
-- and dataset.harvestingclient_id IS NULL --uncomment to see 0 unharvested
order by datasetversion.id DESC
Maybe you have some ideas @scolapasta ? We could only return a current total for harvested datasets, but at that point we might as well pull out the whole query parameter and just have it as a separate api endpoint. I'm keen to find a different option though.
Maybe we could use lastUpdateTime if releaseTime doesn't exist for the datasets?
@matthew-a-dunlap @landreev Is the lastUpdateTime
the time stamp that harvesting uses? We're looking for when a time stamp of when the record was added to the Harvard Dataverse, right?
@mheppler All I know for sure is that all the harvested datasets have a lastUpdateTime.
I'm breaking down the bySubject numbers on dvn-vm5. Looking at just datasets everything is very close. Solr: Metric: The few differences could be chalked up to indexing issues. Looking into the dataverses query because that looks more problematic.
There was a change to an icon in the dynamic custom hmpg HTML which will require an update to the Harvard Dataverse customization files that I have set up for Kevin in Google Drive. Just adding this here as a reminder to myself and a heads up to @kcondon.
Regarding the harvested datasets: We do NOT populate the publicationdate of harvested datasets. We only fill the creationdate - and since all the harvested datasets are published by definition, it can be assumed to also be the publicationdate. The harvested datasets in the database that happen to have the publicationdate are the legacy ones that were migrated from DVN3.
We can discuss changing this arrangement separately. But for the purposes of this issue, we should simply go ahead and change the dataset-counting queries to work based on this definition, that all the harvested datasets should be counted as published.
So instead of doing "SELECT ... WHERE ... dvobject.publicationdate IS NOT null" we should be doing "SELECT ... WHERE ... (dvobject.publciationdate IS NOT null OR dataset.harvestingclient_id IS NOT null)"
@landreev Thanks for investigating this! I'll make the change :)
I've run into more problems that I thought trying to get all the file/dataset queries to work dynamically for harvested/local. I removed the dataLocation option from all files queries (as we don't use them in homepage anyways) and from dataset/bySubject . The harvest/local/all queryParam for the other dataset queries seems to work well.
After removing this from dataset/bySubject I realized that it was a hard requirement for homepage to get all the results. Talking with @landreev earlier, we agreed that the base query that we had used for datasets/files is a bit confusing and should be rewritten, but I had hoped to avoid doing that as part of the homepage story.
We may be able to sidestep this issue somewhat by writing a different/simpler query that gets the subject counts without caring about the timestamp, and having that return harvest/local. But it'll make the metrics api a bit more confusing and is still work.
I'm out tomorrow and will be unable to work on this. Feel free to revert my last two commits if needed to work on the bySubject query.
btw, the approach I was trying was to update this section of bySubject/toMonth:
from datasetversion where datasetversion.dataset_id || ':' || datasetversion.versionnumber + (.1 * datasetversion.minorversionnumber) in
removing it to be how the basic toMonth
query is now. There may be some problem with this tho as harvested datasets may not have a datasetversion.
I can definitely help figuring out better queries there. Just to confirm that I'm reading this correctly - the "totals" queries are now working correctly (for local, harvested and/or both); and the bySubject query is working correctly for local datasets, but not for harvested ones - ? - I'll look into it.
And yes, it looks like the only harvested datasets that have numeric version numbers are the ones harvested from other Dataverses. The ones harvested from generic OAI archives and such don't. Whether this is a problem necessarily - we need to find out; that fragment in the query:
... ':' || datasetversion.versionnumber + (.1 * datasetversion.minorversionnumber) ...
may simply become a "0" when the version numbers are missing; and it would still uniquely identify the dataset, in combination with the dataset id.
(and yes, the bySubjectToMonth should be the same query as bySubject - but with the time argument added...)
@landreev thats correct the totals look to be working correct now. Thanks for looking into this.
so yeah, these lines:
datasetversion.dataset_id || ':' || max(datasetversion.versionnumber + (.1 * datasetversion.minorversionnumber))
or
datasetversion.dataset_id || ':' || datasetversion.versionnumber + (.1 * datasetversion.minorversionnumber)
both result in empty strings when versionnumber and/or minorversionnumber are null. so count(*) works - it just counts lines, regardless of the content. But "where ... in ..." using this expression only finds the versions with the version numbers present.
(I'm working on a simpler query)
OK, I haven't really made it simpler per se; I'm still relying on the "max(datasetversion.versionnumber + (.1 * datasetversion.minorversionnumber))" gimmick in order to select the latest released version, for the local datasets (haven't been able to think of a simpler/cleaner query). But I got it to work with harvested datasets, and I used a simpler query for those - that relies on the assumption that all the harvested datasets are published, and that there's only one version per dataset.
(I've only modified the datasets/bySubjectToMonth query; if any other similar queries in there need to be able to select either local, or harvested, or both - they need be similarly modified)
Misc HTML + CSS + layout improvements
Javascript fixes
Other customization fixes
Homepage template fixes
Additional curation efforts
Related GitHub Issues
Updated Activity section
Misc notes...