AiDAPT-A / VisArchPy

pipelines for the extraction and processing of visuals from PDFs
https://visarchpy.readthedocs.io
MIT License
3 stars 1 forks source link

Beta testing #40

Closed manuGil closed 1 year ago

manuGil commented 1 year ago

Use cases:

Liviavanvliet commented 1 year ago

I found a couple of issues when configuring the settings:

  1. First there was an error with finding the .xml, despite having the correct path: OSError: Error reading file 'data-pipelines/data/design-data100/00001_mods.xml': failed to load external entity "data-pipelines/data/design-data100/00001_mods.xml"

    • So it tries to load 00001 instead of staritng at 00000
    • Changing .xml and .pdf id to 00001 fixed this
  2. Image settings don't seem to do anything, they don't change the size of the output image

  3. When iterating over more than 1 .xml it only outputs images from the 2nd id (e.g. 00002) and not the first

manuGil commented 1 year ago

Thank you @Liviavanvliet I will look into this.

manuGil commented 1 year ago

@Liviavanvliet

  1. That is expected
  2. Image setting define a filter to exclude images whose sizes are smaller than the values provided (pixels). output images for the layout analysis are extracted in their original resolution. I included an explanation in the new version of the README.md: https://github.com/AiDAPT-A/OpenDesign-Handbook/tree/pipeline2
  3. Can you share the code you used for the iterating over IDs, please?
Liviavanvliet commented 1 year ago

@manuGil

  1. oh I didn't see that, thanks!
  2. I realised later it was my error when modifying the values on lines 212-213, so I tried a different code to iterate over the files which eventually did work:
if __name__ == "__main__":
    input_dir = "MY_INPUT_DIR"
    # List files in the directory
    file_list = os.listdir(input_dir)

    # Iterate over the files
    for filename in file_list:
        if filename.endswith("_mods.xml"):
            entry_id = os.path.splitext(filename)[0]  # Extract the filename
            entry_id = entry_id[:-5] # Removes _mods leaving only the id
            main(entry_id)