Universal-Rom-Tools / Universal-XML-Scraper

Scraper de Rom
195 stars 46 forks source link

Suggestion (didn't find where to send) Do the hash directly on the pi instead of downloading #80

Closed giventofly closed 7 years ago

giventofly commented 7 years ago

Why not do the hash of the file directly on the pi (since it establishs a ssh connection, it would improve speed and make the bigger files hash possible), stephen selphs scrapper does the hash on the pi - ok, its run from there, but having ssg access it could be done too.

where is the place to make this proper sugestion?

Universal-Rom-Tools commented 7 years ago

You are at the good place to do ;)

So some explaination :

UXS don't download anything. It do the hash on the file directly on the PI (if they are there ;) ) Hashing a small file (less than few Mo) is really quick... Near instant.

But for very big file (like ISO) it's very long (the hashing time is exponential with the size).

I'm not sure calculating hash directly on the PI will be faster (The PI is less powerfull than your PC I hope ;) ) and Hash calculation is only a CPU problem ;) Faster is your CPU, faster is your Hash.

The Idea is to not calculate "all" the Hash (I hashing CRC32, MD5, and SHA1) but for file bigger than 50 Mo, I hash only in CRC32.... Maybe I can try just Hash the MD5 or the SHA1 to see wich one is the fastest one ;)

giventofly commented 7 years ago

But the problem is not the time on doing the hashin, the problem is downloading the file from the pi to my pc. the scrapper crashes because of that, Its faster to do the hash on the pi, than to download ~700MB and hash.

Universal-Rom-Tools commented 7 years ago

Why do you copy them ? Why don't you scrape directly on your PI ?

UXS download nothing from the PI to your PC... It scrape directly the file where it is... (So if it's on you PI it scrape it on you PI with the network access... but don't download it)

giventofly commented 7 years ago

for the time it takes, i'm guessing it downloads the file and hashs it on my pc.

when i run sselph scrapper it way faster doing the hash (snes, mega drive, etc) than using the uxs, i'm pretty sure it downloads via samba and hashes it on my pc. Is there a way i can test it?

Universal-Rom-Tools commented 7 years ago

I don't understand, really...

It's sure UXS doesn't copy anything. On my computer, it take less than 1 sec to hash (CRC32, MD5, and SHA1) small file (like snes, megadrive, etc...)

Did you try the 2.0.0.8 and look at logs ? (you can look at them in the Help menu)

I added the Hash Time just after every hash.

Can you send me your log (with pastebin) ?

Did you use WIFI ? maybe your connection is really slow ?

giventofly commented 7 years ago

Donwloaded the new version and can check the bash time.

running top and nload to confirm the CPU consumption on the pi, and to confirm the downloading of the files (unless a hash has over 1GB each its donwload the files and doing the hash on my pc instead of the pi).

i was doing nothing more with the pi (can check on the top too)

log for psx: http://pastebin.com/u2LGrDtM (xml stops responding after awhile when it starts the download but its going). image from top+nload: http://imgur.com/a/iKPtH

tried for gbc, log (here is clearer the download in the log since it extracts the .zip in my local machine) and can check the transfer usage of the sum of the roms (rebooted pi to get new nload info).

http://pastebin.com/NtfpdRHB http://imgur.com/a/WMX14

as you can see its download the files instead of hashing them directly on the pi.

Universal-Rom-Tools commented 7 years ago

Hum... I think there is misunderstanding...

this is the function I use to hash : https://www.autoitscript.com/forum/topic/95558-crc32-md4-md5-sha1-for-files/

As I can see, it create a mapping in memory to work on the file (and Hash it). It doesn't really download the file, but "open" it in memory (ok... in the network point of view it's the same :S ) (Like when you watch a movie from a network drive, it doesn't "download" the file, it open it on the network, but when the movie is over, you have the "full" file virtually downloaded to your local computer in Network point of view.)

For Zip File, it's different, if not found by hash or name UXS unzip it... I need to put it somewhere X| so yes, in this case there is an unziped file on local...

The thing is, I really don't know how to do it in an other way.

Hashing directly from the pi Need a "hasher" on the PI (I don't know how to dev that :S ) Need to send a command to the PI to ask for Hashing a specified file. (with plink, why not) Need to grab back the result to use it on local :S

All of that need a lot's of ACK to be sure the hash is started, finished.

On top of that, the PI isn't as powerfull as your computer... So, even if the PI Hash the file. Not sure we have lot's time difference :S

Again, it's only on "big file" where it's a problem. (small fill take less than 1 sec to be hashed) So try to use the Scrape search Mode option on "Filename" instead of "CRC+Filename". No Hash were calculated...

I change some hash "order" in the 2.0.0.8... So now : If file bigger than 50Mo -> No CRC32 were calculate (CRC32 is very slow) If file bigger than 500Mo -> No SHA1 In all case, MD5. (except if you choose Filename only in Scrape Search mode)

giventofly commented 7 years ago

i will check this, its easy for sure, i'll check the functions you use and report back to you.

yeah, i understand the differences, basicaly it "streams" the file over the networt but in the end its "downloading it" it would be way faster to do it on the pi, i'm pretty sure. Pi has a better processor than my last netbook :) (pi 3 at least).

i will give you a report with how to do it over ssh and/or options to give to the users,

btw thank you for your time spent on this "issue", if i'm able i will translate the missing parts to portuguese too, but i'm kind of short on time.

giventofly commented 7 years ago

Okay, to crc32 takes around 4s for each PBP file i would say its okay :)

i guess all the tools are installed from default in the raspbian OS (default linux tools, so i guess they are), but if not there are two options to deliver to the user

1) they do it by themselves sudo apt-get update sudo apt-get install x y z

crc32: sudo apt-get install libarchive-zip-perl crc32 alternative: sudo apt-get install cksfv

2) you do the commands via ssh thru the UXS (at this moment don't know how to check the result, but will find out while i'm doing this report)

Okay, so, how to do the hashing and get the results? My first option is do all the files in the folder put the results in a .txt and get it back and get the images, or, do it 1 by 1, put in a textfile and do the stuff.

Other option, do the hashing and get the results to memory and work from there (1 file at a time), again at this moment i don't know how to do it, but i guess i will in the end of this, i'm kind of writting this as i'm doing this stuff :)

CRC32

cksum FILE (pre-installed in all linux dist, on the pi by default) prompt: cksum Dead or Alive.PBP result: 1753605944 359645292 Dead or Alive.PBP took ~3s CRC not crc32, found this later

crc32 (needs to be installed)

crc32 Dead\ or\ Alive.PBP 70de4960 ~ took around 6s, expected output ?

cksfv Dead\ or\ Alive.PBP ; Generated by cksfv v1.3.14 on 2016-12-14 at 20:35.16 ; Project web site: http://www.iki.fi/shd/foss/cksfv/ ; ; 359645292 18:33.08 2016-12-05 Dead or Alive.PBP Dead or Alive.PBP 70DE4960 ~ took around 6s, expected output ?

md5

md5sum Dead\ or\ Alive.PBP (pre installed in all linux dist) 0dd734dd2c1eff5bf34b3301a3812a5d Dead or Alive.PBP ~ 3s expected result?

sha1

sha1sum Dead\ or\ Alive.PBP 5637ba2bd3db22e83cb5649eb3813bd0f39dde9d Dead or Alive.PBP ~ 6s expected result?

Okay, so tools to check only for the crc32 missing (if its really crc32 and not just crc, but its easy to get that working too) and pi works pretty well for hashing the files.

Now second part, get the results, via .txt file is easy

command FILE > results.txt scp results.txt (don't know how you are getting the files from the remote machine, but this is thru the ssh session) extract stuff from results.txt

get the results "on the fly" from http://superuser.com/questions/130443/remotely-run-script-on-unix-get-output-locally

[quote] ssh remote_host "ls > /tmp/file_on_remote_host.txt"

For saving output locally,

ssh remote_host "ls" > .\file_on_local_host.txt

To combine stderr remotely and save it and stdout locally,

ssh remote_host "ls 2>&1" > .\combined_output_on_local_host.txt [/quote]

Now, the zip files, just unzip file.zip

unzip Uno\ (U)\ [C][!].zip Archive: Uno (U) [C][!].zip inflating: Uno (U) [C][!].gbc

and then: operation file.gbc rm file.gbc

probably its better to create a tempdir to extract to mkdir tempdir sudo unzip file.zip -d ./tempdir and after everything is done: rm -rf tempdir

can also unzip all files in a row unzip .zip -d ./tempdir and then do eveything in a row (like i discussed earlier sha1sum > results.txt), but probably 1by1 is more check errors safe.

In conclusion, only for the crc32 you need to instrall extra stuff on the pi, the rest is there by default and process in a very reasonable time, way better than "downloading" the file over the networkd and do the hashing locally.

Its safe to do everything on the pi and will probably expedites this process way to faster now :)

Guess this would be easy to implement, btw i'm on a pi3, other pi versions should take a little longer, but in the end should be a little bit faster for slower versions and WAY WAY faster for any pi3.

will wait for some feedback on you.

Universal-Rom-Tools commented 7 years ago

wow... such a great job you do ^^

Ok, so I just test md5sum and sha1sum on recalbox (I think there is no problem on Retropie, you can install what you want on it ^^) and it works great...

I use plink to use ssh access.. (like putty but only on command line) And I can catch the stdout... directly and use it ;) so no need to play with file on the PI.

I'll try to add this experimental function ;) I keep you in touch

Universal-Rom-Tools commented 7 years ago

I spend my day on it ^^ but it works ;) it's faster for big file, but, strange thing : it's slower on small file :S (I'm on a PI2)

Just Added this "experimental" function to the 2.0.0.9 To use it, add this in the UXS-Config.ini : $vHashOnPI=1 $vRootPathOnPI=/recalbox/share/roms ($vRootPathOnPI is the local path on the PI to the roms folders, you need to adapt for retropie)

I release in a minute ;) can you test it ?

giventofly commented 7 years ago

Yes, do i add that or the new version is available ?

Universal-Rom-Tools commented 7 years ago

you need this version : https://github.com/Universal-Rom-Tools/Universal-XML-Scraper/releases/tag/2.0.0.9

launch it at least 1 time (to generate all the file) And edit UXS-config.ini

Add this to the end :
$vHashOnPI=1 $vRootPathOnPI=/recalbox/share/roms

Replace /recalbox/share/roms by the Retropie path to the roms folders

and try to scrape ;)

giventofly commented 7 years ago

working, indeed so much faster :)

doing the scrapping faster than gbc now :D

can i give another sugestion? The option to select the game, or to input the url from screenscrapper when it does not found the game. For example, for dreamcast i can see the game on the site, but even if i search for filename (i copied the name from the site) it still misses it.

Universal-Rom-Tools commented 7 years ago

Great it works ^^

Please open a new Issue for new question ;)

(I can close this one so ^^)