Signbank / Global-signbank

An online sign dictionary and sign database management system for research purposes. Developed originally by Steve Cassidy/ This repo is a fork for the Dutch version, previously called 'NGT-Signbank'.
http://signbank.cls.ru.nl
BSD 3-Clause "New" or "Revised" License
19 stars 12 forks source link

Dataset Media Manager View #1172

Closed susanodd closed 2 months ago

susanodd commented 7 months ago

Includes json compatible methods for uploading videos via the API interface

The new view allows a dataset manager to:

susanodd commented 7 months ago

Yes, the above is the goal. This is still a work in progress. The json urls will be available when the rest is bug-free.

susanodd commented 7 months ago

Here are before and after screenshots after uploading a zip file, with the unzipped files stored in the import_videos folder.

dataset-media-manager-1

dataset-media-manager-2

susanodd commented 7 months ago

A command already exists for putting those videos into the glosses. That's going to be integrated into this. Then the respective commands also offered in the api.

susanodd commented 7 months ago

I forgot the hash in this commit:

https://github.com/Signbank/Global-signbank/pull/1173/commits/4d31ae5fa0aef53c95abb68f2ee6a3315b1d9644

susanodd commented 7 months ago

The interface looks like this now, with the pathnames of the uploaded, unzipped video files. They are stored in the location for importing videos. They follow the subfolder structure required in the zip file the manager uploads (the api functions mimic this #1152)

dataset_media_view_uploaded_files

susanodd commented 7 months ago

I'm wondering to what extent it is useful to provide functionality here?

I made it have a hover over the individual file names.

The existing functionality for importing videos (going to be revised and put here), does a batch import over the files shown here.

@uklomp @rem0g would you like to be able to do things at the individual file level?

Like some check boxes at the side?

And what about the api? Is a batch command for importing them sufficient?

This will appear on the signbank-dev when it is ready, then it's possible to try out different things without messing up the real signbank.

susanodd commented 7 months ago

Btw, it works for uploading a zip created on Apple and on Ubuntu. We'll have to try the Windows ones.

rem0g commented 7 months ago

Thank you, the media manager seems great .

API is our requirement to import the videos from out platform automatically. If it can be done via body in POST method then that would be great. Response in JSON is required to handle callbacks.

We will use the api more than the interface.

susanodd commented 7 months ago

Okay, that's clear. Most of the templates use txt response. (They all started out in Django 1.11.) I've been using the JsonResponse in the more recent code, and gradually updating things. The JsonResponse is a Django class. Do you need additional things in it? Our templates that use it fetch the things out of the json. I added an "errors" field in the upload zip archive one. If you get a failed result then that is Django doing that. (Like using an alphabetic field instead of a numeric.) Django also handles the csrf tokens. (We don't generate those.) It also inspects the request.user. But we add additional permissions checks. We can't make some of the things "open" so a login is required.) I guess if I modify the DatasetMedia template to use the same json methods, then you'd be able to use the same. Luckily if the template interface is designed well, Django gives you some things for free in updating the visible data.

I'll find out about adding your server to the trusted ip's. I know I can just modify stuff in the settings (they are different on the actual server than the ones in GitHub), but we only do the "containers" and that the container is accessible isn't done by us. I thought @Woseseltops could do that but he can't. So I'm not sure if they would just take down the container if it started doing extra stuff. I know the containers can't modify anything outside of the container.

susanodd commented 7 months ago

Oh yes, if you get some weird errors that the url isn't found, it could be that the url isn't specified that something is a "digit" pattern. I only encountered this yesterday when testing out the urls. They worked on Ubuntu at the uni, but they didn't work on PyCharm running the local server in iCloud. Django complained the url didn't exist. I'll check the various urls to make sure the digit pattern is used.

susanodd commented 7 months ago

@rem0g do you have any need to "keep" the zip files archived on signbank? (Like maintain an archive of uploaded zip files so you can see them stored there.)

susanodd commented 7 months ago

Here's with showing the zip files and they contents, in the case there are already uploaded zip archives.

dataset-media-zip-archives

rem0g commented 7 months ago

@rem0g do you have any need to "keep" the zip files archived on signbank? (Like maintain an archive of uploaded zip files so you can see them stored there.)

No not required.

susanodd commented 7 months ago

I added some visual information about the zip file contents and the uploaded files (that are staged for processing) The first screenshot is when the import_videos/tstMH folder is empty and no zip files exist in TEMP.

manage_media_tstMH

manage_media_status_jmported

susanodd commented 7 months ago

The same kind of information will be added to the data returned by the json upload zip archive. Probably as a dictionary showing the uploaded files and the ignored files. I'm not yet sure the best way to convey the "success" of uploading an archive. Above, the files that are not mp4 are ignored from the zip contents. And on inspecting the "files in the import_videos" folder, then whether or not a matching gloss was found. The import is several steps that depend on each other.

On my local computer, I have lots of "bogus" or "wrong" stuff that has been put in the import_videos folder during debugging. Also in the past. Probably the timestamp of the file is also useful. (To be able to figure out and allow the dataset manager to remove stuff. Or a command to clear the folder.)

susanodd commented 7 months ago

So far, I think it needs to be several steps.

  1. uploading and unzipping the videos to the import_videos folder of the dataset.
  2. Import the properly formatted videos to the glosses assuming they exist.

For both parts, it kind of needs to be atomic. For the second, it can be atomic per gloss, assuming the files that have not been processed yet remain there. For an api interface, this is a bit trickier since the api should not execute extremely long transactions on the database. Thus for the second, it can be some kind of loop in the background, that the api initiates an import on the import_videos staged videos. Then somehow get notice when it has been completed.

susanodd commented 7 months ago

Here's another screenshot of NGT. Here you can see actually none of the files in the zip file have corresponding glosses. Plus there are a bunch of "old" files in the import location that weren't inside the zip.

media-import-ngt-showing-status

susanodd commented 7 months ago

I'll add the timestamp to the information. Otherwise it's too confusing if there is old stuff.

rem0g commented 7 months ago

For both parts, it kind of needs to be atomic. For the second, it can be atomic per gloss, assuming the files that have not been processed yet remain there. For an api interface, this is a bit trickier since the api should not execute extremely long transactions on the database. Thus for the second, it can be some kind of loop in the background, that the api initiates an import on the import_videos staged videos. Then somehow get notice when it has been completed.

We just can handle one video per transaction via JSON API, so its best to have a limit of videos per API request at your side.

Callback on that same transaction is important so its best to work with for loop in one large request and then for every item send back callback to user interface and API via JSON. Otherwise the browser and API at the other side will assume the upload failed after x seconds of timeout.

susanodd commented 7 months ago

@rem0g what about using jsonlines so that a json file containing all the uploaded videos per gloss can be streamed?

This would work, similar to how csv files are generated. (Because this also takes time, because there are thousands of lines, we already do this.)

Then you'd get a json (json lines) file as a result.

rem0g commented 7 months ago

How can it be streamed?

susanodd commented 7 months ago

It's a Django class StreamingHttpResponse. I guess for the end user it's not streamed, but it makes it so the results can be fed into the result file over time. Before we used this, the file object would time out before being completed because it took so much time to generate the result. So on your end it would just be waiting for the file to be available. As it is for the zip file being generated by the package url.

susanodd commented 7 months ago

I'm going to try and see if this works with each json object row being the result of an atomic operation to save the gloss video, so that saving each video is completed as a transaction. Then it will just write the results to a json file.

susanodd commented 7 months ago

@rem0g do you have any preference for the fields of the json that is returned?

I have it so for each file in the import_videos folder (as shown in the screenshots above, only a partial path will be shown, starting with import_videos ....) So the possible outputs are "errors" (no gloss is found, or the video has the wrong format) or that it was successful and the "gloss id" and the video filename is included.

(The nesting is a bit inconsistent, because in some cases, the gloss is not found.)

It looks like this at the moment (this is returned in a json file) for the above example:

{
  "import_videos_status": [
    {
      "import_videos/NGT/nld/GEMEEN-2755.mp4": {
        "errors": "Gloss not found for GEMEEN-2755"
      }
    },
    {
      "import_videos/NGT/nld/GEMEEN.mp4": {
        "gloss": "2755",
        "videfile": "GEMEEN.mp4"
      }
    },
    {
      "import_videos/NGT/nld/KLAUW-A.mp4": {
        "errors": "Gloss not found for KLAUW-A"
      }
    },
    {
      "import_videos/NGT/nld/GEVEN-E-3689.mp4": {
        "errors": "Gloss not found for GEVEN-E-3689"
      }
    },
    {
      "import_videos/NGT/nld/GEVEN-E.mp4": {
        "gloss": "36727",
        "videfile": "GEVEN-E.mp4"
      }
    },
    {
      "import_videos/NGT/nld/GELIJKWAARDIG.mp4": {
        "gloss": "3944",
        "videfile": "GELIJKWAARDIG.mp4"
      }
    },
    {
      "import_videos/NGT/nld/STOMP-B.mp4": {
        "errors": "Gloss not found for STOMP-B"
      }
    },
    {
      "import_videos/NGT/nld/AANKIJKEN-A-1.mp4": {
        "errors": "Gloss not found for AANKIJKEN-A-1"
      }
    },
    {
      "import_videos/NGT/nld/AANBIEDEN-A-3689.mp4": {
        "errors": "Gloss not found for AANBIEDEN-A-3689"
      }
    },
    {
      "import_videos/NGT/nld/BRAM-SCHOEMAKER-E.mp4": {
        "errors": "Gloss not found for BRAM-SCHOEMAKER-E"
      }
    },
    {
      "import_videos/NGT/nld/POMPOEN-A.mp4": {
        "errors": "Gloss not found for POMPOEN-A"
      }
    }
  ]
}
susanodd commented 7 months ago

In the above, for the actual "import", a link to the gloss video will be included, too. (It makes coding a mess because I can't keep running the same example once it completes.)

susanodd commented 7 months ago

Okay, after actually importing the videos, this is the json format for a successfully imported video:

{
  "import_videos/NGT/nld/GEMEEN.mp4": {
    "gloss": "2755",
    "videfile": "GEMEEN.mp4",
    "status": "Success (Video File Overwritten)"
  }
},
susanodd commented 7 months ago

Here's for the tstMH dataset, with videos as shown above:

{ "import_videos_status": [ { "import_videos/tstMH/nld/GEVEN-E-3689.mp4": { "errors": "Gloss not found for GEVEN-E-3689" } }, { "import_videos/tstMH/nld/GEVEN-E.mp4": { "errors": "Gloss not found for GEVEN-E" } }, { "import_videos/tstMH/nld/STOMP-B.mp4": { "gloss": "42090", "videfile": "STOMP-B.mp4", "Video": "http://localhost:8000//dictionary/protected_media/glossvideo/tstMH/ST/STOMP-42090.mp4", "status": "Success", "errors": "" } }, { "import_videos/tstMH/nld/comments_ZIEK-B-529.mp4": { "errors": "Gloss not found for comments_ZIEK-B-529" } }, { "import_videos/tstMH/nld/POMPOEN-A.mp4": { "gloss": "42087", "videfile": "POMPOEN-A.mp4", "Video": "http://localhost:8000//dictionary/protected_media/glossvideo/tstMH/PO/POMPOEN-42087_o1pOkHU.mp4", "status": "Success (Video File Overwritten)", "errors": "" } } ] }

susanodd commented 7 months ago

Unfortunately for that last one, the "backup" video system is interfering with this. The suffix _o1pOkHU on the filename of the "Video" link should not be there. (It's referring to a backup video.) @Jetske is working on this in #1129

susanodd commented 7 months ago

I modified the code to remove the video files that do not match any gloss in the dataset. But they remain visible in the "zip archive" summary, so it's possible to see them still.

There is something wrong with the permissions for dataset subfolders in glossvideo/acronym/XXwhere the videos are stored. The import fails if these are somehow do not have write permission, also if the existing videos do not have write permission since they need to be moved to a different name. (They need to have group permission for the website. But they seem to have 644 permission.)

Jetske commented 7 months ago

@susanodd the suffix _o1pOkHU gets added when it tries to save a file with a filename that already exists in that folder. So first the old file should be made into a backup file (or deleted) and then it can be saved just fine.

susanodd commented 7 months ago

It's the add_video method that is doing that. I noticed there is an if-else clause and the if clause does stuff with GlossVideo versions, and the else clause (for if it's not an in memory stored), just ignores this. So I'll try to fix that for the second case, since it's already a real file.

This is the if test (so it goes wrong when that is false):

if isinstance(videofile, File) or videofile.content_type == 'django.core.files.uploadedfile.InMemoryUploadedFile':

Jetske commented 7 months ago

When it fails that if test, it doesn't save the video though. So then in that case it shouldn't create a file with the suffix either

susanodd commented 7 months ago

I added some print statements, it does the first if-clause. But the videofile.content_type causes a runtime error because it doesn't have that field. I put in a print statement to see what the content type is, but that didn't work. Then printed a dict but that crashed after a few files.

The original code for importing videos from the import_videos folder circumvents the gloss.add_video method and just uses shutils to move the file.

I'm wondering if that needs to be adapted, because the File that is being moved is in a totally different location and can't be stored inside GlossVideo. (Its path goes to import_images/acronym/....) That can't be stored in GlossVideo. Also because the file needs to be deleted.

susanodd commented 7 months ago

I could disallow to overwrite an existing file. Then to figure out how to get the versioning to work. Those Django File objects are annoying because the path is inside them. Except if there exist GlossVideo entries without corresponding files, those need to be deleted. (So it needs some kind of test to see if there is anything already in GlossVideo for the gloss, plus check whether the files exist, and if not delete the things from GlossVideo.)

susanodd commented 7 months ago

UPLOADED UNZIPPED ZIP ARCHIVE READY TO IMPORT

display-videos-to-import-archive-before-import

AFTER IMPORT LISTING

imported-videos-listing-after-import

susanodd commented 7 months ago

This is on signbank-dev now.

There is a button to Manage Media in the Dataset Detail View (get there e.g., via the Available Datasets page, choose NGT)

Woseseltops commented 7 months ago

Just tried uploading @ocrasborn's zip, as requested by @susanodd . If I make it a flat folder (so the videos are on the highest level directly), I get a nice red bar telling me the structure is incorrect (but not what it should be, would be an improvement I think). If I make it so the videos are inside a folder called NGT, I get

image

Is the zip still incorrect? Let me know

susanodd commented 7 months ago

The zip file structure should be:

NGT/nld/glossannotation.mp4

So it should report that one as incorrect, too. :(

There should be another folder under NGT, the three-letter language code.

(I thought that is what @ocrasborn gave you? It's the same structure as the original command he used to use...)

Woseseltops commented 7 months ago
susanodd commented 7 months ago

Did you do the Import button?

  • Okay with that filestructure uploading works! Please put this example in the error message and everything should be fine :).

    • I do note that the videos do not show after it marks an import as successful... but perhaps that's related to settings on signbank-dev ?

@Woseseltops did it show anything? I looked at the Media Manager page again, because it keeps the zip files there. I checked two of the glosses. Gloss 24.B has a video now. But Gloss 32 did not import because that's a Lemma translation, not a Gloss annotation. (It needs the annotation text, not the lemma to identify the gloss.)

I checked some of the videos, whether they exist:

imported-videos-NGT-BE

I'll check to see what it's doing. In the Zip upload section, it hasn't yet checked whether the gloss exists. In the "import_videos/NGT" part it has the Status and shows there if it didn't find the gloss. (With a red X.)

I revised the error reporting on the zip file structure. That's on signbank-dev now.

susanodd commented 7 months ago

It could be that the video has not completed yet on the file system?

I'll modify the Search form on Gloss List so if you search on Has Video, it checks that the file exists also. (It should do that anyway.)

susanodd commented 7 months ago

@Woseseltops I modified the Search so that when you search on "Has Video" it checks that the gloss also has a video file.

This helps on signbank-dev since you can find the videos you just imported.

The Search used to only look in GlossVideo, so all the "missing videos not on signbank-dev" ended up showing up in the search. Now it's easier to debug the code since missing video files are not shown as Has Video results.

susanodd commented 7 months ago

The code still finds really old videos (locally, at least) where the video file exists but there was never an entry in GlossVideo. I suspect there are some really old glosses on signbank where this is true. Also, if the citation image is missing, but there is a video, it looks empty in the results. But you can create a citation image from the video.

(Should this kind of information be added somewhere? It's like "quirks" about video files. These are kind of meta-information about the system, not really about searching for glosses.)

susanodd commented 6 months ago

This is on signbank now.

susanodd commented 6 months ago

This is live on all signbanks.

ocrasborn commented 6 months ago

@susanodd , great that this is possible now! I would suggest also adding a button on the Dataset Management page, which to me sounds like a bit more natural a location -- it's related to managing the dataset, after all. Same for the 'Corpus Overview' button, really. @uklomp , what do you think?

susanodd commented 6 months ago

I agree, actually. But we're missing @vanlummelhuizen a bit here with the Dataset Management page. I first tried to add the buttons there, without success. (The buttons need to be caught by the python View code. There are zillions of permissions checks on this page!!!! I was not able to add the buttons in a way to display them to look nice. Probably we need to introduce panels or some other layout here.)

uklomp commented 2 months ago

I guess this is finished?

susanodd commented 2 months ago

I didn't implement the request from @ocrasborn yet about the video tags.

I see that that issue was closed, so this issue can be closed now.