dhlab-epfl / dhSegment

Generic framework for historical document processing
https://dhlab-epfl.github.com/dhSegment
GNU General Public License v3.0
370 stars 116 forks source link

Can't be installed under Windows #8

Closed Jim-Salmons closed 6 years ago

Jim-Salmons commented 6 years ago

dhSegment is AWESOME and EXACTLY what my wife and I need for our post-cancer #PayItForward Bonus Round activity doing grassroots #CitizenScience #digitalhumanities research in support of eResearch and machine-learning in the domain of digitization of serial publications, primarily modern commercial magazines. We are working on the development of the #MAGAZINEgts ground-truth storage format providing standards-based (#cidocCRM/FRBRoo/PRESSoo) integrated complex document structure and content depiction models.

When a tweet about dhSegment surfaced through my feed, I could barely contain myself... we have detailed, multi-valued metadata -- based on a metamodel of fine-grained use of PRESSoo's Issuing Rules -- that describe the location, bounding box, size, shape, number of colors, products featured, etc. for 7,157 advertisements appearing in the 48 issues of Softalk magazine (https://archive.org/details/softalkapple). It will be trivial for me to generate the annotated label images for all these ads as we have already programmatically extracted the ad sub-images from the full pages once we used our "Ad Ferret" to discovery and curate the specification for every ad.

Once we have a dhSegment instance trained on the Softalk ads, there are over 1.5M pages just within the "collection of collections" of computer magazines at the Internet Archive, and many millions more pages of content in magazines of all types over considerable time periods of their serial publication. The #MAGAZINEgts format, together with brilliant technical achievements like dhSegment, can open new levels of scholarship and machine access to digital collections. We believe dhSegment will be a valuable component for our research platform/framework.

With great excitement I chased down and have installed and tested the prerequisite CUDA and cuDNN frameworks/platforms under Windows. I have these features now working at the 9.1 version. (This alone was tricky, but I got it working.)

Unfortunately, the current implementation of the incredibly important dhSegment environment cannot be installed under Windows 10. After the stock Anaconda environment yml file died somewhat dramatically, I then took that file and attempted to search for and install each package individually. (NOTE: I am not a Python expert, so what I report here is subject to refinement by someone who knows better...) Here is what is NOT available under Windows:

# Python packages for dh_segment not available under Windows
- dbus=1.12.2
- fontconfig
- glib=2.53.6
- gmp=6.1.2
- graphite2=1.3.10
- gst-plugins-base
- gstreamer=1.12.4
- harfbuzz=1.7.4
- jasper=1.900.1
- libedit=3.1
- libffi=3.2.1
- libgcc-ng=7.2.0
- libgfortran-ng=7.2.0
- libopus=1.2.1
- libstdcxx-ng=7.2.0
- libvpx=1.6.1
- ncurses=6.0
- ptyprocess=0.5.2
- readline=7.0
- pip:
  - tensorflow-gpu==1.4.1 (I did find and installed 1.8.0 instead) 

Anything not on this list made it into my Windows-based Anaconda environment, the yml for which I have included here as a file attachment.

win10_dh_segment.yml.txt

I am so disappointed to not be able to install and use dhSegment under Windows. While a docker image would likely be possible to create, I am skeptical that it would work at the level needed for interfacing with the NVIDIA hardware and its CUDA/cuDNN frameworks, etc. Alternatively, perhaps a cloud-based dev platform would work for us (that is affordable as we are independent and unfunded #CitizenScientists). Your workaround/alternative suggestions are welcome.

At any rate, sorry for the overly long initial issue posting. But I wanted to explain my and my wife's great interest in this important technology as well as provide what I hope is useful feedback with regard to its potential use under Windows. Looking forward, I am very interested in evolving a collaborative relationship with you good folks of DHLAB.

ITMT, I am going to generate the labeled training images. :-)

Happy-Healthy Vibes, FactMiner Jim

P.S. Here is our #DATeCH2017 poster that will further explain the focus of our research. salmonsbabitsky_factminerssoftalk_poster

P.P.S. And here is a screenshot showing a typical metadata "spec" for an ad. The simple integer value for the AdLocation is used in concert with an embedded DSL in the fine-grained Issuing Rules of the Advertising Model. This DSL provides a resolution-independent means to describe and compute the upper-left and bounding box of an ad. For example, the four locations of a 1/4 pg sized ad on a page with a 2-column page grid are numbered 1-4, left-to-right top-to-bottom. The proportions of these page segments based on simple geometric proportional computations. magazinegts_documentstructure_advertisements

And finally, the evolving #MAGAZINEgts for the Softalk magazine collection at the Internet Archive is available here: https://archive.org/download/softalkapple/softalkapple_publication.xml

Jim-Salmons commented 6 years ago

Hi again, folks... I would like to have a chat with someone from your team to ensure that -- given that I cannot run a local instance of dhSegment via Windows -- ...to ensure that the set of labeled training images I generate to be used to teach a dhSegment about commercial magazine advertisements, that I do it right.

Along these lines, it appears that the standard run configuration assumes that both the target image and labeled training image directories are to be found as siblings in a local directory. We routinely access high resolution scan images of our digital collection directly from the remote Internet Archive servers. Might it be possible to provide user-configurable image directories to accomodate such use cases? These, btw, are use cases of particular interest to grassroots #CitizenScience/History projects.

Also, I mentioned that our ground-truth storage model used a metamodel subgraph design pattern than incorporates a DSL (domain-specific language) supporting a resolution-independent semantic to describe the upper-left and bounding box of our advertising dataset. Here is a screenshot of an AdPosition element within the PageGrid of one of sixteen AdSpec elements that describe the Issuing Rules-based design of Softalk magazine.

magazinegts_metamodel_issuing_rules_admodel

In this case, we're describing the four possible positions of a 1/3 of a page sized ad, vertically oriented on a 2-col page. While the semantics of our embedded PageGrid DSL were derived from our Ad Ferret prototype's Python source code, I wrote a parser for this semantic to avoid direct execution of content read from an XML-based #MAGAZINEgts publication file.

In addition, the UI/UX of the current generation "Ad Ferret" tool configures its AdSpec widgets and all their logically-consistent state changes based on the Issuing Rules of the Advertising Model in the Metamodel partition of the #MAGAZINEgts file.

I look forward to your best advice on the preparation of the 7,157 labeled training images I need to generate in hope of an opportunity to train a dhSegment on our #MAGAZINEgts model and associated reference image dataset. :-)

Happy-Healthy Vibes from Colorado, USA, Jim and Timlynn, too

solivr commented 6 years ago

Hi Jim and Timlynn, thank you for your interest in our tool and your feedback!

We don't have a Windows machine available at the moment but I will try the installation on a Windows laptop later this week and get back to you quickly (however, my machine has no GPU so I won't be able to test the tensorflow-gpu package). In the meantime, could you please check that you installed tensorflow-gpu properly by validating your installation as indicated here ? I think that you'll need to downgrade to CUDA 9.0 to use tensorflow 1.8.

Concerning the image / labels directories location for the training, we could adapt the code to allow the user to input a csv file with the following schema : PATH_TO_IMAGE;PATH_TO_LABEL_IMAGE So you would need to generate such csv file with your data. Would this be a solution for you ?

Also regarding XML format, we use some of the objects defined by PAGE XML such as TextRegion, GraphicRegion, ... to convert annotations in XML format to label images and vice versa. You can have a look at the methods in PAGE.py and adapt it to the needs of your XML schema.

Best, Sofia

jkloe commented 6 years ago

I was able to install it on a Windows machine with tensorflow-gpu. Sofia is right, you have to carefully check the compatible versions of tensorflow <-> CUDA <-> python.

SeguinBe commented 6 years ago

Great news @jkloe, good to know things work well on windows.

I think we have to make a better conda environment file, I think it should be possible to have a cross-platform environment.yml if we only specify the higher-level packages (tensorflow, scipy, numpy, opencv, etc...) without all the dependencies.

Jim-Salmons commented 6 years ago

Thank you wonderful people for your quick and supportive reply. With the timezone difference I am just getting up and will test and tweak, if needed, my current conda environment to further explore and hopefully validate a successful installation under Windows.

The alternate, optional method of supporting a CSV file with image paths -- providing that either of these paths can be remote urls -- would be most helpful. +10 :-) The most likely use case being a remote image url, e.g. a high resolution "leaf" image from an Internet Archive collection, with the labeled training path being to a directory on a local machine. Such remote access to public image collections, like the Internet Archive and the Europeana Newspaper collection, will be most helpful for dhSegment's support of grassroots #CitizenScience and #CitizenHistory projects.

Will report on my progress refining the Windows install of dhSegment. Thank you again to the entire extended team/projects involved in the development of dhSegment. This is very important work you are doing. dhSegment is the most exciting OLR/OCR tech project I have seen since the #DATeCH2017 LAREX tech/papers by Christian Ruel and folks at U. Wurzburg (http://www.is.informatik.uni-wuerzburg.de/index.php?id=181460).

As our FactMiners' project has excellent mentor/collaborator relations with PRImA researchers, it is also exciting to see that, like LAREX, dhSegment supports PRImA's PAGEgts format. PAGEgts is part of the #MAGAZINEgts ontological "stack" in support of an integrated document structure and content depiction model for, in our case, magazines as a special case of PRESSoo's serial publication documentation standard. So it is encouraging to see PRImA's ground-truth format used by such excellent and important Europeana development projects.

Jim-Salmons commented 6 years ago

Okay, here is a quick late-afternoon experience report update from "across the pond"... I am in the process of parallel installing CUDA 9.0 and its associated cuDNN 9.0-friendly edition to support dhSegment. My investigation suggests that there was no need to uninstall 9.1 as it is installed in a separate subdirectory from 9.0. So I just need to be aware of precendence relationships in the PATH searches that might impact dhSegment.

I have not been able to forge ahead with excited reckless abandon due to the fact that -- while we have huge "real estate" in terms of available Microsoft OneDrive cloud storage, our aging local machines do not have sufficent local transient storage to move stuff from cloud storage to local off-line storage due to the constraint of the total size of this aged local storage... IOW, it was easy to generate and incrementally upload huge amounts of in-process and not-yet-published image datasets, it is just tricky to create available local storage by copying this cloud-stored data on to local hard drives in order to support more room for development platform requirements. Nothing that can't be solved... just a gotcha in terms of time-to-reply. IOW, cloud-based JIT storage and retrieval works GREAT if you just want small "gulps" of what is in the cloud. It is another story if you want to move around big chunks of that cloud-based data. :-/

As I write this update, I have been able to transfer a huge amount of stuff from cloud storage to local off-line storage, so I should be able to "go dark" ATM and push forward with taking your advisement under consideration in checking our the ability for dhSegment to be installed in a gpu-enabled Windows configuration. :-) more as it unfolds...

Also, in pursuit of the available alternatives to work with you good folks, I did some checking into Google's TensorFlow Research Cloud offering/service (https://www.tensorflow.org/tfrc/). I honestly believe that this ATM free service might be the IDEAL platform/venue for you dhSegment folks to work with globally diverse and highly-motivated #CitizenScience/#CitizenHistory projects, like FactMiners and The Softalk Apple Project. I would love to explore this potential with your folks further. ITMT, I am going to dive back into what I should now be able to do now, with additional local storage space, to finish installing and validating my 9.0 configuration of CUDA and cuDNN. More as it unfolds...

Happy-Healthy Vibes, Jim & Timlynn FactMiners and The Softalk Apple Project

Jim-Salmons commented 6 years ago

TaDAAA!!!! :-) :-) We had an extra special "sundown chill" moment this evening as I just finished installing and validating our dhSegment instance under Windows 10 on a consumer/home level development box when our daily reflective moment arrived!

Seriously, this gives me goosebumps. Despite NVIDIA's multitude of "do you want to tell them or should we tell them" non-critical warning messages about the severe memory constraints that this software was being asked to perform under and that would dramatically impact performance, I can confirm that we are up and running at the level of exercising the pre-configured demo program. :-)

We have not yet attempted our own training. But the fact that we are good to go at a proof-of-capablity level is INCREDIBLY exciting.

I did go ahead and submit an application for FactMiners' participation (AKA me and Timlynn) to participate in the upcoming Tensorflow Research Cloud (beta) program. Have you good folks at DHLAB already applied to this OpenScience collaborative opportunity? If so or if not, are you interested? We would love to work with you on our research agenda via mutual access to this incredibly "juicy" research resource. We do have a number of fellow researchers in the EuropeanaTech community who would likely be interested in such a collaborative co-development/exploration opportunity. This might be something we could talk about sometime soon.

These future potentials aside, here is a screenshot of my progress running the dhSegment demo program on an aging local non-tricked-out dev box... and yes, downgrading to CUDA 9.0 was needed but did not require uninstalling 9.1.

dhsegment_windows_factminers

SeguinBe commented 6 years ago

Well it is great news you managed to make it work.

About the Tensorflow Research Cloud, we do not feel it is really aimed at us as the computational requirement of this project is fairly low (as it was designed). We only need a single GPU for a couple hours most of the time, so Cloud TPUs are definitely overkill for our situation :-)

I have given thought of making an affordable cloud based version for allowing people to process their data in an efficient manner even if they do not have powerful machines, but it's more of a daydream of mine to be done after I finish my thesis.

I'll close the issue as it seems installation on Windows is not an actual problem, though we'll improve the requirement file.

Jim-Salmons commented 6 years ago

Hi Benoit! Thanks for the follow-up. Without yet any actual experience trying to run dhSegment against our dataset and not having tried to train the demo, I have no real appreciation for what can be done with relatively minimal hardware. ATM, I only have 2GB on an aged GTX 960, but it works. It looks like an 8 GB GTX 1070 is around $400 USD ATM, a 1080 is about $550 USD. If this all looks promising, I would certainly twist my own arm to consider a timely upgrade to my desktop dev box. But I'll hold out until that is necessary.

That is encouraging that you folks are able to do "real work" (but slowly) on modest hardware. I'll hope that our requirements will allow the same.

I was thinking about the TF Research Cloud not so much as a necessary access to a significant hardware platform, but rather that it might be a good way to encourage and support adoption and diverse use of dhSegment by what I hope will be an increasingly large and active community of researchers, especially folks like us... un/under-funded, independent #CitizenScientists. And with my "marketing hat" on, I was thinking that you good folks having some visibility supporting interesting dh projects around the world might get Google's attention such that they would promote and support your project as a good example of their "selfless" motive for creating the TFRC program in the first place... and that might have some impact for sources and interest in future funding to support DHLAB researchers and projects. :-)

At any rate, while I have not tried to prune my currently functional Windows-based dhSegment conda environment for unnecessary packages, I did clone it and do as many conda and pip package upgrades as possible then test that dhSegment still ran. The resulting env yml for my current environment is here: win_environoment.yml.txt If you generate a new multi-platform env yml, don't hesitate to ping me to test it here if that will help.

My next steps are to take a closer look at your code and the various demo folders of images, models, etc. in preparation for generating the thousands of labeled training images we'll need to teach a dhSegment about magazine advertisements. :-)

Thanks again for your help and interest.

Happy-Healthy Vibes, Jim and Timlynn