ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Project Evolution #192

Open acrois opened 2 years ago

acrois commented 2 years ago

Hello, I would like to propose a plan for the future of this project.

Purpose:

It will give the community of developers and users a common structure to collaborate as well as improve the quality and speed of archiving the web.

Scary issues out of the way first: We don't want to make any license changes, avoid major (breaking) refactors (requiring end-user changes), and maintain support for existing feature functionality. We should focus new development efforts on performance, stability, scalability, features, and pain points (a little of everything). We will want to encourage the adoption of the library and improve the user experience of the application. We will want to establish and grow the project's visibility by providing a tool to aspiring archivers (known by some as web preservationists) to use as well as give them a clear way to contribute to the massive work that needs to be done in order to preserve the web as we know it.

As a result, I will present to you an outline of the things I would like for us to agree and embark on with this project. I have roughly ordered them for (my perceived) importance to the project (vs complexity vs dependency). This is done considering what falls within the aforementioned categories of visibility, stability, performance, and features. This list and its ordering are subject to change as community prioritization process happens and we take this project to the next level. I would love to have a discussion in the comments if you are curious about any of this. After all, we need consensus to be considered a community. This is just a launch point for us to get started from.

I truly believe that all this is achievable, but it will require help from the community (you). Without any further feedback, I will be personally pitching in a lot on the code, as well as provide consultancy for PRs, but this project is bigger than me and I expect that we would like to encourage open collaboration on these issues. I am grateful to have the opportunity to not only use it, but also contribute to this wonderful project. I hope we can evolve this greater, together within the next 3-6 months!

Please review:

Project management
    Discover, define, estimate tasks
        GitHub Projects or other project management/tracking software
    Contribution guidelines
        Application versioning - Semantic versioning
        Git style
            Git flow?
        Code standards
            Linting
        Release creation documentation
    Application Package Roadmap(s)
        Encompassing all features and their releases in a timeline projected into the future few quarters.

Dockerization
    Document usage in README
    Follow best practices (conventions & security)
        Minimal layers
        Parameter pass-through
        User-space isolation
    Docker-compose/Helm/Kubernetes example(s)
    Test-suite harness
        Automated testing
    Daemon to spin up grab-clients?
        ** Do not expose the docker socket!! :D
        Start crawl from dashboard
        Maybe custom resource definition for a grab-client operator in Kubernetes?
    Related:
        #93
        #182
        #149
        #176
        #175
    Browsing, searching the warc locally
        Visually browse/search warc (external project)
        browse @ http host (external project)
        Usage example and documentation

Dashboard improvements
    Re-organize front-end code
        It is kind of bloated and several files over 1k lines, mixed types of content in HTML files, etc.
    Include attempted crawls, connection error, etc. (re: #93), queued URLS (optionally)
        Queued URL logging would be less costly
            If we were able to put a cap and not max out to as much memory as I (you) have available to the browser.
    Ability to manage ignore sets and ignore rules while crawling
        Potentially modify any option
        Related: #3
    No log mode
        Display aggregate crawl stats only
    Authentication provider / access control
    TLS usage example

Server improvements
    Generalize log exporting layer
        Prometheus metric format exporter
            Allows usage by common libs and integration of system reporting services
        Globally & per-crawl

Client improvements
    Investigate and address random crawl hang issue
        May be able to improve the ratio of connections to request-per-second and archival throughput
        GC, database, expensive function calls, still need to do a perf. analysis on the app while crawling
        Related: #60
    More application-specific ignore set defaults to choose from
        Review to ensure top platforms are up to date
        Better support for forums (vBulletin, IPB, XenForo)
            Related: #178
            The defaults are good but not 100% for some versions of these softwares
                (vb tab pages, print views, smf sorting, etc. etc.)
                This has made a difference between a 500k crawl and a 5m crawl for me
    Ability to resume crawl
        Using ID
        Must define default behavior for when the directory exists once implemented
        Resync (recrawl/reindex)
            Delta WARC?
        Related:
            #57
            #58
            #185
    Dead URL / Dupe spotter false positives
        Optimize to avoid dead URLs
        Related: #43
    Windows CRLF/LF/CR adaptability
        Related: #48
    Detect when crawls have been limited, back off exponentially until crawl can resume
        Max retries, automatically adjust rate limiting until "sweet spot" is achieved where blocking does not occur

Documentation improvements
    Document resumption of a grab (specifically ID ("job_data.ident") field)
        Including more explicit docs on STOP and START process signal option (which already supported and code example)
        Different failure scenarios, also: when to just... start over!
    Some copy or quote to inspire people to become archivists
        Link to the ArchiveTeam wiki
    Systems documentation
        Internal concepts, implementation specific details, etc.
        Directory structure / descriptions of operational data files
        More in-depth documentation on gs-server and the role of it in the application architecture
            Functional dependencies between grab-site and gs-server, running grab-site standalone?
        Document parameters and live config update conventions
            Which parameters can be updated live, limitations, etc.
    Management
        Background Processing / Daemonization
        Scaling
        Logging
        Storage Management
        Resource Monitoring (IO, CPU, RAM, HDD)

Application packaging
    Deployment
        GitHub packages (docker, python)
        Docker Hub (docker)

Project website
    Statically generated website for landing, docs, etc.
    GitHub actions & GitHub sites?
TheTechRobo commented 2 years ago

I agree with most of what you said, but I don't like this:

no major (breaking) refactors (requiring end-user changes),

IMO while we should avoid them, there will be times when it is necessary to do it.

acrois commented 2 years ago

@TheTechRobo Thanks for the feedback

I totally agree with you. You have to sometimes. Sweeping statements/ideas like "don't make breaking changes" are never universally true or achievable. What I meant by that is if we build our next milestones in 1-2 quarters based on just the list, I feel strongly that it should be achievable without breaking things at all. I've revised the wording in the issue a bit to try and reflect that.

In the case of grab-site, I think getting a stable 2.x and then thinking about breaking things in a 3.x release version with all that stuff (and potentially upgraded/rewritten features) would be less volatile. I suppose it all just depends on the nature of change.

On that subject: Maybe there are some other features that people want/need that might/will break things. What do you think would be the best way to go about discovering and including those issues in creating a more concrete plan for this project?

TheTechRobo commented 2 years ago

What do you think would be the best way to go about discovering and including those issues in creating a more concrete plan for this project?

Probably create a guthib projectboard or milestone for "Look at later; requires refac" or something.

TomLucidor commented 7 months ago

Hope things turn out a bit better in the future