Create an unique identifier for any NFe informed

adilsoncarvalho commented 8 years ago

It seems a good idea to be able to quickly find if a specific page got scraped or not.

At Paraná state we enter a summary page to then go to the real one that has a random url, so we can generate the unique identifier from the entry point url.

Entry point url: http://www.dfeportal.fazenda.pr.gov.br/dfe-portal/rest/servico/consultaNFCe?chNFe=41161176189406002412651190000337421101819214&nVersao=100&tpAmb=1&cDest=02236640900&dhEmi=323031362d31312d31305431383a31393a32312d30323a3030&vNF=38.72&vICMS=3.26&digVal=502b7663785154305335316d536b34726f70392f4e7561774578493d&cIdToken=000001&cHashQRCode=CE624A6959CF87362570112D42D9B1F49A82B8EE

I'm not sure about which method is better:

MD5 1d962fbf639ececb4cd41c998d7d7ed6
SHA1 28d2e48ad54e28c7c378d6ce926dbb86c165aade

Beyond that, does it make sense? What if two different people offer the same url? Would it make sense to reuse the results we got from the initial crawl to use to them both?

One thing to consider is the immutability of those documents. Once they've been generated it shouldn't change and if it changes, what matters to us now are the prices used.

Just questions, no answers 🤔

adilsoncarvalho commented 8 years ago

I read a few documents this week about the differences between MD5 and SHA1 concerning hash collision and while MD5 offers a good support, SHA1 is considered stronger on this matter so we're going to use SHA1 as our default hash to generate unique values every time it's needed.

So, from the above post we can now say this:

~~I'm not sure about which method is better:~~

~~MD5 1d962fbf639ececb4cd41c998d7d7ed6~~
SHA1 28d2e48ad54e28c7c378d6ce926dbb86c165aade 🏆

adilsoncarvalho commented 8 years ago

While activating the http cache we learned that generating hashes may get tricky.

Give a look at Understanding the hashes used as filenames

adilsoncarvalho / barateza-nfcrawler

Create an unique identifier for any NFe informed #24