epfLLM / meditron

Meditron is a suite of open-source medical Large Language Models (LLMs).
https://huggingface.co/epfl-llm
Apache License 2.0
1.77k stars 159 forks source link

Errors with three of the scrapers #36

Closed jpcorb20 closed 3 months ago

jpcorb20 commented 4 months ago

Hello,

I was trying to scrape magic, drugs and guidelinecentral without success, while some others were fine. Any idea how to make them work? Drugs seemed to work but 0 article was in the JSONL. GuidelineCentral got some click issues. FInally, Magic printed errors for each article but one.

Thanks in advance,

jpcorb20 commented 4 months ago

Looks like in the case of Drugs.com, it works by changing "class='ContentBox'" to "class='ddc-main-content'", and the content variable needs to be called with "content.get_attribute('innerHTML')" to get an HTML str for markdownify.

jpcorb20 commented 4 months ago

For the GuidelineCentral, I got some issues first with the chrome driver on WSL in general and changed some options, but it looks like the "--headless" option shouldn't be activated for this scraper to work.

AGBonnet commented 3 months ago

Hi @jpcorb20,

Thanks a lot for your interest in the guidelines scraping pipeline.

As described in the user notice, these scrapers are very fickle and aren't made to withstand the dynamic nature of websites. They worked in November 2023, but there's no guarantee that they would hold in the future.

If you'd be interested in updating the pipeline to update the scrapers, please do a pull request. We'd be grateful for your help.

jpcorb20 commented 3 months ago

Hello @AGBonnet, thanks for your reply! Definitely, I will put a PR regarding the updates of two scrappers (Drugs and GuidelineCentral).