code-for-venezuela / c4v-py

3 stars 3 forks source link

Luis/crawler #79

Closed LDiazN closed 3 years ago

LDiazN commented 3 years ago

Crawler: Discover new urls from known sitemaps

Added base class to create crawlers to retrieve interesting urls

Problem

We need to find a way to discover new urls for known and scrapable sites.

Proposed solution

Since most of our currently supported sites have sitemaps, we can use them to discover new urls. Sitemaps provide useful information for every every link such as:

This first iteration supports:

Solution:

Sample usage:

from c4v.scraper.crawler.crawlers.primicia_crawler import PrimiciaCrawler

pr = PrimiciaCrawler()

urls = pr.crawl_urls(print) # print resulting urls as they come

Interesting files:

Further work:

dieko95 commented 3 years ago

@LDiazN great work!! Also, thanks a lot for the detailed PR description πŸ™Œ

It looks that a test failed within github actions failed

ERROR tests/data/test_tweet_loader.py
LDiazN commented 3 years ago

@LDiazN great work!! Also, thanks a lot for the detailed PR description πŸ™Œ

It looks that a test failed within github actions failed

ERROR tests/data/test_tweet_loader.py

I see, the error itself is this one:

ImportError while importing test module '/home/runner/work/c4v-py/c4v-py/tests/data/test_tweet_loader.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/python/pywrap_tensorflow.py:64: in <module>
    from tensorflow.python._pywrap_tensorflow_internal import *
E   ImportError: /home/runner/work/c4v-py/c4v-py/.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: _ZN10tensorflow10DeviceNameIN5Eigen9GpuDeviceEE5valueE

During handling of the above exception, another exception occurred:
tests/data/test_tweet_loader.py:1: in <module>
    from c4v.data.tweet_loader import TweetLoader
src/c4v/data/tweet_loader.py:2: in <module>
    import tensorflow as tf
.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/__init__.py:41: in <module>
    from tensorflow.python.tools import module_util as _module_util
.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/python/__init__.py:40: in <module>
    from tensorflow.python.eager import context
.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/python/eager/context.py:35: in <module>
    from tensorflow.python import pywrap_tfe
.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/python/pywrap_tfe.py:28: in <module>
    from tensorflow.python import pywrap_tensorflow
.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/python/pywrap_tensorflow.py:83: in <module>
    raise ImportError(msg)
E   ImportError: Traceback (most recent call last):
E     File "/home/runner/work/c4v-py/c4v-py/.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/python/pywrap_tensorflow.py", line 64, in <module>
E       from tensorflow.python._pywrap_tensorflow_internal import *
E   ImportError: /home/runner/work/c4v-py/c4v-py/.nox/tests-3-8/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so: undefined symbol: _ZN10tensorflow10DeviceNameIN5Eigen9GpuDeviceEE5valueE

It seems like there's a tensorflow dependency missing or something, but I don't know what's the cause πŸ€” , I didn't even change any dependency related stuff. Some hint about what could it be?