Closed LDiazN closed 3 years ago
@LDiazN great work!! Also, thanks a lot for the detailed PR description π
It looks that a test failed within github actions failed
ERROR tests/data/test_tweet_loader.py
@LDiazN great work!! Also, thanks a lot for the detailed PR description π
It looks that a test failed within
github actions failed
ERROR tests/data/test_tweet_loader.py
I see, the error itself is this one:
ImportError while importing test module '/home/runner/work/c4v-py/c4v-py/tests/data/test_tweet_loader.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/python/pywrap_tensorflow.py:64: in <module>
from tensorflow.python._pywrap_tensorflow_internal import *
E ImportError: /home/runner/work/c4v-py/c4v-py/.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: _ZN10tensorflow10DeviceNameIN5Eigen9GpuDeviceEE5valueE
During handling of the above exception, another exception occurred:
tests/data/test_tweet_loader.py:1: in <module>
from c4v.data.tweet_loader import TweetLoader
src/c4v/data/tweet_loader.py:2: in <module>
import tensorflow as tf
.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/__init__.py:41: in <module>
from tensorflow.python.tools import module_util as _module_util
.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/python/__init__.py:40: in <module>
from tensorflow.python.eager import context
.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/python/eager/context.py:35: in <module>
from tensorflow.python import pywrap_tfe
.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/python/pywrap_tfe.py:28: in <module>
from tensorflow.python import pywrap_tensorflow
.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/python/pywrap_tensorflow.py:83: in <module>
raise ImportError(msg)
E ImportError: Traceback (most recent call last):
E File "/home/runner/work/c4v-py/c4v-py/.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/python/pywrap_tensorflow.py", line 64, in <module>
E from tensorflow.python._pywrap_tensorflow_internal import *
E ImportError: /home/runner/work/c4v-py/c4v-py/.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: _ZN10tensorflow10DeviceNameIN5Eigen9GpuDeviceEE5valueE
It seems like there's a tensorflow dependency missing or something, but I don't know what's the cause π€ , I didn't even change any dependency related stuff. Some hint about what could it be?
Crawler: Discover new urls from known sitemaps
Added base class to create crawlers to retrieve interesting urls
Problem
We need to find a way to discover new urls for known and scrapable sites.
Proposed solution
Since most of our currently supported sites have sitemaps, we can use them to discover new urls. Sitemaps provide useful information for every every link such as:
url itself
So we can use it to filter interesting or new data.
This first iteration supports:
Solution:
Base class
BaseCrawler
that handles requesting, filtering, and processing logic, provides a main functioncrawl_urls
that returns a list of scraped urls, and optionally receives a callback to process subsets of urls as they're being scrapedSuch base class provides methods to parse specific sitemaps, they can be easily overrided:
parse_urls_from_sitemap(self, sitemap : str) -> [str]
: parse a list of urls from a sitemap xml formatted stringparse_sitemaps_urls_from_index(self, sitemap_index : str) -> [str]
: parse a list of urls from a sitemap's index xml formatted stringTo implement a new
BaseCrawler
subclass, implement the mandatory abstract methodcheck_sitemap_url(url : str) -> bool
that checks if subset of the sitemap located byurl
is an interesting one.check_page_url(str) -> bool
function that checks if an url to a post is an interesting oneSample usage:
Interesting files:
src/c4v/scraper/crawler/crawlers/base_crawler.py
: Base class definitionsrc/c4v/scraper/crawler/crawlers/primicia_crawler.py
: Crawler for primiciasrc/c4v/scraper/crawler/crawlers/el_pitazo_crawler.py
: Crawler for El PitazoFurther work: