OriHoch / hackathon-tasks

MIT License
0 stars 0 forks source link

משיכת תמונות #1

Open namlool opened 6 years ago

namlool commented 6 years ago
הי אני מתכוון למשוך תמונות בצורה אקראית כדי לעבוד עליהם. אני לא ממש רוצה עכשיו להתעסק עם ה API מאחר שאני רוצה לעבוד על הלוגיקה בעיקר. יש איזה שהוא מצב שיש פשוט ספריה שממנה יהיה אפשר למשוך תמונות או לחילופין ספריה שתיהיה עטיפה לכל זה? אני כותב ב C# יניב
OriHoch commented 6 years ago

I'm not familiar with c# and Microsoft environments but if you can find a library in c# which works with a certain data format or server I can write a pipeline which will generate or export the data into this file format or create a server with this data

OriHoch commented 6 years ago

also, if you could be more specific about what you intend to do, or example of existing code in C# which does what you want to do - that will be very helpful to understand how I can help you

namlool commented 6 years ago

Basically I have my own face recognition engine (originally base on opencv with improvements).

It knows how to correlate between people without Age and emotions differences.

Basically I want to use it to correlate between Identical persons in the images.

Now… I want to work on the logic and the recognition engine improvements and not wasting time on API. For if for example I could use the images from shared drive instead of using the API it will assist me.

OriHoch commented 6 years ago

cool, I can do that, how about if I provide you a CSV or XLS file which contains urls for the images?

namlool commented 6 years ago

CSV will be great! Thanks!

sinairusinek commented 6 years ago

Hi Ori, and all! I have a similar request: Our team, Streetext, would like to work on street adds/posters and could use a dump of the posters. Our goal is to develop search functionality and name entity recognition for the text. We already have a set of transcribed and analyzed texts for the posters listed in the attachments, but do not have the images - and these we would need ASAP, in order to train an OCR model for them BEFORE the hackathon that would enable working on a larger set of the posters. I would prefer a dump - I would not know what to do with CSV file of the URLs- and I am not sure the coding team members will have the time to do it before the Hackathon... Thanks! lists of the transcribed images-20171116T064021Z-001.zip

OriHoch commented 6 years ago

hi @sinairusinek I also hope to work as much as possible before the hackathon :+1:

hopefully this weekend I'll dive into the data a bit more and hopefully provide some exports and dumps

could you be more specific about how you would like to get the data dump? what kind of attributes you need for each item, in what format, etc..

I opened the zip file but don't understand what it contains.. :(

OriHoch commented 6 years ago

I assume the data dump will be a zip file with images? do you need them ordered in some ways? how do you want the metadata etc..

sinairusinek commented 6 years ago

Yep, a zip file will be great. Here is the complication: the numbers in the list are the system numbers (מספרי מערכת) of the images. Ideally, the numbers will be the filenames of the images, so that we can most easily align them with our text. Thanks, and sorry to hear about your weekend... enjoy it nonetheless!

OriHoch commented 6 years ago

cool, system numbers as filenames it is!

sinairusinek commented 6 years ago

btw, I am not sure this is helpful, but a manual search for each image would look like this:

http://merhav.nli.org.il/primo_library/libweb/action/search.do?&tab=default_tab&srt=rank&ct=search&mode=Basic&dum=true&indx=1&fn=search&vid=NNL_Ephemera&vl(freeText0)=700173153

OriHoch commented 6 years ago

@sinairusinek @namlool status update pipelines are here main obstacle is downloading all the manifest files, once it's done (in a few hours I hope) you will have your packages

OriHoch commented 6 years ago

csvs with all the images are available here - https://github.com/OriHoch/nli-data-pipelines/tree/data/data/sequences

OriHoch commented 6 years ago

images are downloading.. it takes a long time because NLI generate them on-the-fly.. you can see progress here http://104.154.42.101/pipelines/#anchor-ALL-download-images

OriHoch commented 6 years ago

images are available in google storage by system numbers - https://storage.googleapis.com/nli-images/002366843.jpg (I'm still downloading..) also, in some cases there are multiple images per system number, these files looks like this - https://storage.googleapis.com/nli-images/002366888_FL33248443.jpg and the first one is only with the system number: https://storage.googleapis.com/nli-images/002366888.jpg

OriHoch commented 6 years ago

hmm, it's harder then I though main problem I guess is that the API is very slow.. not sure how it will work at the actual Hackathon with even more people all working with it together, maybe they will scale up..

OriHoch commented 6 years ago

כל אוסף הפוסטרים - בקובץ זיפ https://www.dropbox.com/s/rdihh0i69hg7hwl/ephemera.zip?dl=0 בינתיים באיכות נמוכה, בהמשך יהיה באיכות מלאה

namlool commented 6 years ago

מבחינתי הקובץ אקסל עם התמונות ששלחת והלינק אליהם מספיק טוב.

אני עובד על תמונות וזיהוי פנים.

כרגע פשוט לוקח הרבה מאוד זמן לוקח להוריד כל תמונה אני מקווה שזה ישתפר

NLI-API commented 6 years ago

> מאחר שאני רוצה לעבוד על הלוגיקה בעיקר. יש איזה שהוא מצב שיש פשוט ספריה שממנה יהיה אפשר למשוך תמונות או לחילופין ספריה שתיהיה עטיפה לכל זה?

יש המון כאלה, ניסית לחפש? מאחר ובספרייה הלאומית מימשנו את ה-API בתקן IIIF שהוא תקן בינלאומי רחב היקף ומוכר, אתה יכול להוריד ספרייה שעובד עם IIIF ורק להגדיר לה את ה-API Endpoints של הספרייה הלאומית.

כל הרעיון הוא שאתם לא תתעסקו הרבה עם ה-API אלא דווקא עם הלוגיקה, בגלל שאנחנו מממשים תקן בינלאומי שכבר נכתבו לו הרבה ספריות.

NLI-API commented 6 years ago

cool, I can do that, how about if I provide you a CSV or XLS file which contains urls for the images?

The URLs to the images are not URLs to the image, but to an API service which output is in pixels (this is not a link to a file on server but rather to a webservice). Therefore, you have to "talk" with the API to get an image.

NLI-API commented 6 years ago

Yep, a zip file will be great. Here is the complication: the numbers in the list are the system numbers (מספרי מערכת) of the images. Ideally, the numbers will be the filenames of the images, so that we can most easily align them with our text. Thanks, and sorry to hear about your weekend... enjoy it nonetheless!

Sinai, that is easier that you think. Just employ a IIIF library and have the document identifiers as input, and you're done with the API access and can concentrate on the OCR. you don't need dumps.

NLI-API commented 6 years ago

We have added more computing power to the API this afternoon, we will scale up more if needed according to the demand. PLEASE DO NOT HARVEST the API. The National Library of Israel's API is intended for online use and not for harvesting.