alirezamika / autoscraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
MIT License
6.24k stars 654 forks source link

Added metadata field #20

Closed Narasimha1997 closed 4 years ago

Narasimha1997 commented 4 years ago

This new PR allows users to add metadata dictionary and save/load it. Since metadata is a generic dict, users are free to add any kind of metadata. Some examples include - Author, license, description etc. This provides an identity to the learnt rules. (would be useful for those who publish their work)

  1. Added set_metadata() and get_metadata() to bring in these features.
  2. Changes are made to load() and save()
  3. Updated docs reflecting these features.

Metadata field would be useful, we can save any sort of information along with the rules. In future if you try to add any other fields to the saved representation you can include them in metadata field, without making any major change to the codebase.

alirezamika commented 4 years ago

I'm not sure about its usage. There's no usage of metadata in the code. We don't know what data will be saved and what its usage will be. The class should store information which it needs and store it in a way which is best for its usage. This for now I think is a too general approach for some ideas which is not still curated and implemented. Once the ideas are fixed and cleared, we will probably find a better and explicit approach for them.

Narasimha1997 commented 4 years ago

Yes, you are right. But right now we can save/load only the learnt rules. If user wants to include any other miscellaneous information, there is no way they can save it. In other words, the saved file has no meaning unless it explains what it is, who created it and why it was created and what it can do. Every file format has this feature which contributes towards explainability You can provide this as an add-on, as it won't affect any core features. Users can save any information they wish. For example, some users would like to save URLs from which was used for scraping. Some might like to add description etc.

For a user who uses the rules created by others, this information would serve useful as he can understand what the rules are for.

It's just my opinion. You take a call

alirezamika commented 4 years ago

I understand your point. But this doesn't have any structure. For example If anybody uses his own structure for adding author or description etc, how can you use it in a proper way?

Narasimha1997 commented 4 years ago

Got it ! So what fields would you like to include? As of now ??

PickNickChock commented 4 years ago

I don't know, if this PR will be approved, but just in case: in this line (62) metadata_to_save = metadata if (metadata and metadata != {}) else self._metadata you can shorten if (metadata and metadata != {}) to if metadata since empty dict will evaluate to False anyway

Narasimha1997 commented 4 years ago

In the latest commit I have fixed the basic structure of metadata, these include : author, author_email, model_name, description , target_urls , keywords. We can extend this info in future, if there is a requirement. These are the basic fields that any saved model would except.

alirezamika commented 4 years ago

As we don't have an actual usage for now, I prefer to postpone it. Because it is subject to change and change will have high cost in future as some people may have used it already. For example think about when we are actually implementing it when we need it (like in the cool hub idea). We may conclude that it would have been better to use author info as dict {'email': .., 'name':...} for working with APIs instead of this. (It's just an example). But we can't change with ease for backward compatibility. I agree to the need of this info in future, but it's better to approach it when we have completely worked it out and know what we need and why we need it. :)

Narasimha1997 commented 4 years ago

Sure! We can. No issues. Got your point, let's postpone this.