Closed deoxykev closed 10 months ago
Interesting idea. But I'm not sure if this is overkill. And they probably won't include CSS and JavaScript, which may contain linked assets too.
Probably better to do atomic, focused unit testing actually. So many edge cases.
Eventually, we’ll want to test on realistic data for benchmarking and finding edge cases in the code. I’m thinking we could use a warc player and the common crawl C4 database (realnewslike) which is about 34GB.
https://github.com/allenai/allennlp/discussions/5056