Generic crawler using Beautiful Soup

rnllv commented 1 year ago

Implement a generic and simple crawler using Beautiful Soup that can be extended to meet specific parsing requirements under this project.

This crawler should handle

absolute and relative paths
cyclic references
http and file based uris

rnllv commented 1 year ago

Refer this for basic scraping and extraction https://realpython.com/beautiful-soup-web-scraper-python/

rnllv commented 1 year ago

It's assumed that none of the html pages will contain absolute file path (path starting with /) as a value for href.

@IanMayo, I'll provide a test file for this assumption that you'll need to execute in client machine to validate.

IanMayo commented 1 year ago

Correct - all file references are relative. The content has been moved to different folder structures several times, and still works perfectly.

Aah, so this is another "quality control" test - that all references are relative. This is just in case there is a historic bug in HTML.

rnllv commented 1 year ago

Our crawler has identified the following invalid hrefs. Will be pushing a draft PR soon.

$ python3 -m legacyman_parser.quality_checks_broken_url

Reference https://deepbluecltd.github.io/LegacyMan/data/Trumpton/Trumpton1.html not found in https://deepbluecltd.github.io/LegacyMan/data/PlatformData/Europe.html
Reference https://deepbluecltd.github.io/LegacyMan/data/Belgium1/Belgium1.html not found in https://deepbluecltd.github.io/LegacyMan/data/PlatformData/Europe.html

rnllv commented 1 year ago

Crawler has identified duplicate references, i.e. same hrefs accessed from different html files.

-
Duplicate reference https://deepbluecltd.github.io/LegacyMan/data/France_Legacy/France_Legacy_Pics.html#number4 detected and avoided from https://deepbluecltd.github.io/LegacyMan/data/France_Composites/unit_a.html
Reference already accessed from https://deepbluecltd.github.io/LegacyMan/data/France_Legacy/unit_d.html
--
-
Duplicate reference https://deepbluecltd.github.io/LegacyMan/data/France_Legacy/France_Legacy_Pics.html#number3 detected and avoided from https://deepbluecltd.github.io/LegacyMan/data/France_Composites/unit_a.html
Reference already accessed from https://deepbluecltd.github.io/LegacyMan/data/France_Legacy/unit_d.html
--
-
Duplicate reference https://deepbluecltd.github.io/LegacyMan/data/France_Legacy/unit_d.html detected and avoided from https://deepbluecltd.github.io/LegacyMan/data/France/France1.html
Reference already accessed from https://deepbluecltd.github.io/LegacyMan/data/Narnia/Narnia1.html
--
-
Duplicate reference https://deepbluecltd.github.io/LegacyMan/data/France_Legacy/France_Legacy_Pics.html#number4 detected and avoided from https://deepbluecltd.github.io/LegacyMan/data/Britain_Standalones/britain_1a.html
Reference already accessed from https://deepbluecltd.github.io/LegacyMan/data/France_Legacy/unit_d.html
--
-
Duplicate reference https://deepbluecltd.github.io/LegacyMan/data/France_Legacy/France_Legacy_Pics.html#number3 detected and avoided from https://deepbluecltd.github.io/LegacyMan/data/Britain_Standalones/britain_1a.html
Reference already accessed from https://deepbluecltd.github.io/LegacyMan/data/France_Legacy/unit_d.html
--
-
Duplicate reference https://deepbluecltd.github.io/LegacyMan/data/France_Legacy/France_Legacy_Pics.html#number4 detected and avoided from https://deepbluecltd.github.io/LegacyMan/data/Britain_Composite/unit_ab.html
Reference already accessed from https://deepbluecltd.github.io/LegacyMan/data/France_Legacy/unit_d.html
--
-
Duplicate reference https://deepbluecltd.github.io/LegacyMan/data/France_Legacy/France_Legacy_Pics.html#number3 detected and avoided from https://deepbluecltd.github.io/LegacyMan/data/Britain_Composite/unit_ab.html
Reference already accessed from https://deepbluecltd.github.io/LegacyMan/data/France_Legacy/unit_d.html
--

IanMayo commented 1 year ago

Oh, I don't think duplicate references is an issue. There are lots of instances of a document being the destination of lots of links.

rnllv commented 1 year ago

Got it.

I'm not saying these are potential issues. However, we'll need our parser to identify already visited pages to ensure

it doesn't end up making recursive calls
same files are not parsed multiple times

IanMayo commented 1 year ago

Aah, ok. Within the "walking" strategy we won't be crawling the pages that receive lots of links (such as the Abbreviations page). I have seen instances where the same class description is called from multiple countries (since they all use it). Yes, we shouldn't parse it multiple times, but to link to the other instance.

I guess we'll capture this in our data model. For a country, it will contain a list of either:

classes in this country (which get parsed)
links to classes from other countries (which just have a link).

Hmm, but for JSON, I guess we should duplicate the data.

rnllv commented 1 year ago

Hmm, but for JSON, I guess we should duplicate the data.

Ahh yes. We'll need to factor this as well.

DeepBlueCLtd / LegacyMan

Generic crawler using Beautiful Soup #26