adilsoncarvalho / barateza-nfcrawler

Crawler to get data from the NF-e and NFC-e
2 stars 1 forks source link

Create an unique identifier for any NFe informed #24

Open adilsoncarvalho opened 8 years ago

adilsoncarvalho commented 8 years ago

It seems a good idea to be able to quickly find if a specific page got scraped or not.

At Paraná state we enter a summary page to then go to the real one that has a random url, so we can generate the unique identifier from the entry point url.

Entry point url: http://www.dfeportal.fazenda.pr.gov.br/dfe-portal/rest/servico/consultaNFCe?chNFe=41161176189406002412651190000337421101819214&nVersao=100&tpAmb=1&cDest=02236640900&dhEmi=323031362d31312d31305431383a31393a32312d30323a3030&vNF=38.72&vICMS=3.26&digVal=502b7663785154305335316d536b34726f70392f4e7561774578493d&cIdToken=000001&cHashQRCode=CE624A6959CF87362570112D42D9B1F49A82B8EE

I'm not sure about which method is better:

Beyond that, does it make sense? What if two different people offer the same url? Would it make sense to reuse the results we got from the initial crawl to use to them both?

One thing to consider is the immutability of those documents. Once they've been generated it shouldn't change and if it changes, what matters to us now are the prices used.

Just questions, no answers 🤔

adilsoncarvalho commented 8 years ago

I read a few documents this week about the differences between MD5 and SHA1 concerning hash collision and while MD5 offers a good support, SHA1 is considered stronger on this matter so we're going to use SHA1 as our default hash to generate unique values every time it's needed.

So, from the above post we can now say this:

I'm not sure about which method is better:

adilsoncarvalho commented 8 years ago

While activating the http cache we learned that generating hashes may get tricky.

Give a look at Understanding the hashes used as filenames