UTMediaCAT / mediacat-domain-crawler

Internet domain crawler
0 stars 0 forks source link

Integrate Date Detection into crawler #6

Closed kstapelfeldt closed 3 years ago

kstapelfeldt commented 3 years ago

Integration of https://www.npmjs.com/package/metascraper

kstapelfeldt commented 3 years ago

Write a function/service that, based on a URL list, will asynchronously retrieve all of the dates.

kstapelfeldt commented 3 years ago

Take all the links from the domain crawler output JSON and make the calls to retrieve dates and other append metadata and update to append this data to the existing JSON.

kstapelfeldt commented 3 years ago

Integration of metascraper is not complete - metascraper runs aync. Alex has tried two separate different approaches but still a lot of bugs.

Some of the functions are being tested. There are two versions that he is working on but each have separate issues. metascraper being async complicates things.

Appears that things are timing out.

kstapelfeldt commented 3 years ago

Alex worked on this but still roadblocks. If metascraper doesn't connect it hangs. Alex still working on issues (and making some progress) a lot of work has been done.

Alex talked to @jacqueline-chan - she will review his work and see if she can work around it.

jacqueline-chan commented 3 years ago

Reviewed Alex's work - He was definitely headed in the right direction! I redid his script as getDates.js under the same branch and it should work now. I will leave @AlAndr04 to review, test and continue from it if need be.

kstapelfeldt commented 3 years ago

Alex has tested @jacqueline-chan 's work and confirms that it works! (yay!).

End point: We need to grab all relevant data through metascraper and concatenate with the JSON that is being produced by Raiyan's crawler. We want 1 JSON file that represents each article with all the metascraper and puppeteer data represented in it.

jacqueline-chan commented 3 years ago

JSON is now concatenated and looks like the example in this readme file: https://github.com/UTMediaCAT/mediacat-domain-crawler/blob/Metascraper/newCrawler/README.md

These changes are under the Metascraper branch