Open adilsoncarvalho opened 8 years ago
I read a few documents this week about the differences between MD5
and SHA1
concerning hash collision and while MD5
offers a good support, SHA1
is considered stronger on this matter so we're going to use SHA1
as our default hash to generate unique values every time it's needed.
So, from the above post we can now say this:
I'm not sure about which method is better:
While activating the http cache
we learned that generating hashes may get tricky.
Give a look at Understanding the hashes used as filenames
It seems a good idea to be able to quickly find if a specific page got scraped or not.
At Paraná state we enter a summary page to then go to the real one that has a random url, so we can generate the unique identifier from the entry point url.
Entry point url: http://www.dfeportal.fazenda.pr.gov.br/dfe-portal/rest/servico/consultaNFCe?chNFe=41161176189406002412651190000337421101819214&nVersao=100&tpAmb=1&cDest=02236640900&dhEmi=323031362d31312d31305431383a31393a32312d30323a3030&vNF=38.72&vICMS=3.26&digVal=502b7663785154305335316d536b34726f70392f4e7561774578493d&cIdToken=000001&cHashQRCode=CE624A6959CF87362570112D42D9B1F49A82B8EE
I'm not sure about which method is better:
1d962fbf639ececb4cd41c998d7d7ed6
28d2e48ad54e28c7c378d6ce926dbb86c165aade
Beyond that, does it make sense? What if two different people offer the same url? Would it make sense to reuse the results we got from the initial crawl to use to them both?
One thing to consider is the immutability of those documents. Once they've been generated it shouldn't change and if it changes, what matters to us now are the prices used.
Just questions, no answers 🤔