NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.62k stars 733 forks source link

Adding table to database #84

Open TheFifthFreedom opened 9 years ago

TheFifthFreedom commented 9 years ago

This is more of a question rather than an issue per se: I'm trying to add an additional table to the database at runtime to store some particular results, and while I'm able to see the table in the output .db file, no results are ever committed to it, which means the table remains desperately blank. Here's what I did:

First, add a class to database.py

class SampleTable(Base):
    __tablename__= 'table'

    id = Column(Integer, primary_key=True)
    sample_text = Column(String)

Then instantiate that class inside of SearchEngineResultsPage's set_values_from_parser function:

for key, value in parser.search_results.items():
            if isinstance(value, list):
                for link in value:
                    parsed = urlparse(link['link'])

                    # fill with nones to prevent key errors
                    [link.update({key: None}) for key in ('snippet', 'title', 'visible_link') if key not in link]

                    l = Link(
                        link=link['link'],
                        snippet=link['snippet'],
                        title=link['title'],
                        visible_link=link['visible_link'],
                        domain=parsed.netloc,
                        rank=link['rank'],
                        serp=self,
                        link_type=key
                    )
                    s = SampleTable(
                        sample_text = 'test'
                    )

The reason I'm surprised to see SampleTable is because the SearchEngineResultsPage object is committed to the sqlalchemy session in scraping.py's store function. Do you have some idea of what it is I'm doing completely wrong?

TheFifthFreedom commented 9 years ago

Actually I figured it out in the end: for those who are curious, it turns out that the reason why Link objects are committed after being instantiated inside of SearchEngineResultsPage despite not being explicitly added to the SQLAlchemy session is because they share a many-to-one relationship with the SearchEngineResultsPage table, as made explicit in their class parameters:

serp_id = Column(Integer, ForeignKey('serp.id'))
serp = relationship(SearchEngineResultsPage, backref=backref('links', uselist=True))

Therefore, if you wanted SampleTable to be committed the same way, simply include some sort of relationship with SearchEngineResultsPage (as long as it makes sense in your schema), such as this one-to-one relationship:

serp_id = Column(Integer, ForeignKey('serp.id'))
serp = relationship(SearchEngineResultsPage, backref=backref('tables', uselist=False))