NIAID-Data-Ecosystem / nde-crawlers

Harvesting infrastructure to collect and standardize dataset and computational tool metadata
Apache License 2.0
0 stars 0 forks source link

[Parser Fix]: OMICS-DI outbound links failing #123

Closed gtsueng closed 1 month ago

gtsueng commented 4 months ago

Issue Name

OMICS-DI outbound links failing

Issue Description

It looks like OMICs-DI may have changed their url structure. The parser needs to be updated so that the urls will work. The NIAID team estimates 70% of the urls to fail.

Issue Example

Example: For this record: https://data.niaid.nih.gov/resources?id=E-GEOD-33531 The outbound link which fails is: https://www.omicsdi.org/dataset/arrayexpress-repository/E-GEOD-33531 The working url for this record: https://www.omicsdi.org/dataset/biostudies-arrayexpress/E-GEOD-33531

Related WBS task

For internal use only. Assignee, please select the status of this issue

Status Description

No response

gtsueng commented 4 months ago

@jal347 @DylanWelzel For updating via mongodb, I suggest that you get a frequency table of the base url links and one example for each. We can then manually check how the urls have changed

gtsueng commented 4 months ago

@rshabman @hartwickma @lisa-mml @sudvenk

Based on our @jal347 and @DylanWelzel's investigations, OMICS-DI changed the url structure for datasets related to biostudies, and their sitemap appears to be broken.

To address this issue:

gtsueng commented 4 months ago

According to @DylanWelzel the changes needed within our database are:

Note the distribution.contentUrl will need to be updated as well.

gtsueng commented 4 months ago

As of today, all 3 links for omicsdi (sdPublisher.url) is fixed in our database and is now on prod/staging. We will not be able to do a fresh run/update run until OMICS-DI fixes their sitemap.xml.

Their robots.txt file indicates that they still have a sitemap.xml, which suggests that the disappearance of the file from the site was not intentional; rather, the generation of the file may have been broken by their url re-structuring.

gtsueng commented 4 months ago

Per discussion at the bi-weekly meeting dated 2024.02.20, @newgene to reach out to PIs of OMICS-DI if no response is heard from @DylanWelzel's attempt to reach someone by the end of this week.

gtsueng commented 4 months ago

As discussed in an internal meeting on 2024.02.21, the OMICS-DI sitemaps are available again. We will proceed to re-crawl the site to generate a new cache. It is estimated to take 45 days to complete.

rshabman commented 3 months ago

Hi all - Noticed that OMICS-DI outbound links are still failing (all of the links I've tried). Can you provide an update on the status? Since OMICS-DI is a large percentage of the available datasets, do you know when the links will be live? thanks!

gtsueng commented 3 months ago

Hi @rshabman @sudvenk the corrections were already made to the database awhile ago and should already be live on production. We are doing a re-crawl of the site just to ensure that it matches, but I think the link failures that you're seeing right now is because of a bigger issue with OMICS-DI itself: image As you can see, the base url for OMICS-DI is giving an error, so their site may be transiently down. The site up/down checker: https://www.isitdownrightnow.com/ suggests that omics-di itself is down right now. image

In our previous discussions of transiently broken links, the following disclaimer was added to the about page to address this issue: image

rshabman commented 3 months ago

Thanks @gtsueng - I should have checked OmicsDI directly. Appreciate the updates, very helpful. No other questions at the moment.

hartwickma commented 3 months ago

Hi @gtsueng - let's add a banner to the site to let users know we are aware of the issue. Can you please include the statement below at the beginning of the message that you linked in your earlier GitHub comment:

"We are currently experiencing a technical issue linking to OmicsDI records. We are aware of the issue and are working on getting it back up and running. [add text from png]"

Is there a current practice to check for the 'outages' at a specific frequency? It seems like it might be good to discuss strategies to catch and address 'down sites' in a weekly meeting.

gtsueng commented 3 months ago

Hi @hartwickma @sudvenk @rshabman @lisa-mml,

The previous discussions for addressing transiently broken links resulted in the disclaimer text being added to the About page, rather than having a banner for the site. For this reason, we are not set up to immediately implement what you've requested.

There are two ways we can address the request to add some sort of notification about the transient outage to the site:

Please let me know how you'd like us to proceed and we will get right to it.

Regarding

Is there a current practice to check for the 'outages' at a specific frequency? It seems like it might be good to discuss strategies to catch and address 'down sites' in a weekly meeting.

We monitor for outages on pages within our control using uptime robot. For pages out of our control, we previously decided against monitoring for broken links; however, that discussion was regarding all the links to the site and it was deemed to be too costly/resource-intensive to implement. That said, monitoring at the domain level should be less resource-intensive, and tools (uptime robot and other paid tools/services) are available (though I've only ever seen uptime robot used for monitoring one's own site, not the sites of others).

hartwickma commented 3 months ago

Hi @gtsueng - thank you for the update and potential options. I checked in today and it looks like OmicsDI is back up and running again. Let's add this issue of 'site outage' to an upcoming meeting so we can better understand the pros and cons of monitoring and providing updates to users about 'domain-level' issues.

As a note, it is important to highlight that the GitHub issue referenced here: https://github.com/NIAID-Data-Ecosystem/niaid-feedback/issues/62 is a discussion about how to deal with broken links for individual datasets. This was resolved by adding a section to the 'About' page. The messaging in the 'About' section does not sufficiently address issue where links to an entire (or a majority) of a whole resposity that are broken or under construction.

Looking forward to a constructive discussion about strategies and appraoches to rapidly identify these issues and provide messaging to the user community.

gtsueng commented 3 months ago

@hartwickma @sudvenk Sounds good. I'll add it to the list of potential agenda items for our next bi-weekly meeting.

gtsueng commented 1 month ago

The changes in the crawler have been implemented and a new build finished successfully. We are marking this issue as pending close out and will close it in a week if there are no additional comments