Woseseltops commented 1 month ago

@rem0g suggested (in #1331) we redesign the video upload. I think the current workflow is:

Name video files after glosses, store them in a zip
Upload a zip file to a server somewhere and grab its url
Call /dictionary/upload_zipped_videos_folder_json/{datasetid}/ and provide this url
Call /dictionary/upload_videos_to_glosses/{datasetid} to actually link them to the glosses.

I want to replace this with just one step, to be run for each gloss invidually:

POST /dictionary/gloss/{gloss_id}/zippedvideo with just the zipped file in the payload.

I don't think uploading the videos one by one has any real drawbacks in terms of overload, and it does make the whole process easier to follow: you upload a video to a gloss, and if it fails somewhere in the process, it can just return what went wrong for that one gloss.

Also, @rem0g suggests to remove the h264 check:

it's not needed to check whether it's h264. If it doesnt play, it wont play and we will know that one way or another.`

Woseseltops commented 1 month ago

@rem0g , would this make your life better?

@susanodd , can you think of other drawbacks?

susanodd commented 1 month ago

I agree the check for h264 is excessive. It was more a check that the user didn't put say, PDF files in with videos. But if we stick with "try-except" with friendly error reporting for when there was a problem, then this could be programmed away, so not to "check before" but just fail.

The zip uploads them to a different location because the normal upload of a video creates an object at the same time.

My opinion is that we should turn off reversion. (It will create a mess if it stays, because it does this every time you upload a video and then moves files around.) I would just remove it entirely and just overwrite an existing video file. Invest more time in saving the writable archive instead.

For the perspective videos, I made it so it just "overwrites" an existing 'left' or 'right' file and leaves the "GlossVideo object" alone. (So its link just points to a replaced file.)

Woseseltops commented 1 month ago

Summary of @susanodd : there are some extra things we should think about if we want to do this right.

rem0g commented 1 month ago

I would rather go this way:

Name video file after signbank ID (because of special chars of some glosses)
Upload the video file to server directly, I don't see good reason for zipping the file besides special chars.
Signbank will automatically bind the signbank video with gloss. There is no reason this could only be done with call, Python and Django are very well able to process this and it's alot easier to implement.
Signbank will provide status in gloss query about the status of video (processing,succeed (at which date). This can be checked periodically than continue to upload videos and see what happens with binding forever; thus reducing need for resources and latency.
History of uploaded videos, because there is no history what happens with video. We should be able to restore by accident deleted videos, this can happen without API and will be very welcomed by other users.

This would definitely make my life (and for other developers easier) as we are going to upload max 6 videos per glos + 1 animation file.

susanodd commented 1 month ago

History of uploaded videos, because there is no history what happens with video. We should be able to restore by accident deleted videos, this can happen without API and will be very welcomed by other users.

This information is already maintained and is in the admin. We can make this visible via the dataset manager page. Operations related to this would need to be added. About the accidentally deleted videos, that is more difficult. Only the paths are stored in the objects, not the actual video. You can merely see what the operation was. Not undo a delete. That the video files are not stored in the database is evidenced by there not being any videos on the development servers, unless uploaded on them.

That's where "reversion" comes in. Reversion adds extra ".bak" to the end of filenames. And it's ".bak.bak.bak" extensions depending on how many deletes. Or it's version #, in the object model with 0, 1, 2, ... and those all get modified whenever a new video file is uploaded. (The code is inconsistent in doing this. Some of this is automated.) I don't like the code that does this at all. Moreover, if the user keeps retrying to upload a video, it keeps making more of these files...

Upload the video file to server directly, I don't see good reason for zipping the file besides special chars.

Signbank will automatically bind the signbank video with gloss. There is no reason this could only be done with call, Python and Django are very well able to process this and it's alot easier to implement.

Where it is uploaded to is a problem. Objects need to be created. At the moment there is also "reversion" as described above.

There is also a problem with transactions versus file system operations. If the file system hasn't finished writing an upload before the object is saved something goes wrong. If the server gets overloaded with lots of file system operations the transactions will fail. Sqlite locks the entire database for atomic operations.

Django writes to temp storage first. All the operations work this way. (Some other issues are having trouble with this.)

susanodd commented 1 month ago

[IMPLEMENTATION]

Maybe it would be possible to split the video upload into two parts, so the "gloss video object" and "gloss video history" objects are created, but only dummy things in the file system, so it can be done fast.

Then couple a cron job that puts all the actual files to replace the dummies. There could be a toggle field in the object to indicate whether it was a dummy or if its contents had been filled in.

susanodd commented 1 month ago

In #1329 the bug seems to be with symbolic links.

I fixed this. I rewrote the code to delete first then upload, instead of replace. There is a setting in default.py that turns off deleting video files. This needs to be turned on otherwise the code cannot delete any files even if the video is deleted in the admin. This remains the same for normal videos. I made new branches for NME videos and perspective videos since the code needs to delete the objects and their videos.

Hopefully this issue can remove the reversion code.

Woseseltops commented 1 month ago

A few responses to @rem0g :

Upload the video file to server directly, I don't see good reason for zipping the file besides special chars.

For speed, but I wouldn't mind doing the raw video file instead.

Signbank will automatically bind the signbank video with gloss. There is no reason this could only be done with call, Python and Django are very well able to process this and it's alot easier to implement.

Can you rephrase this? I'm unsure what you mean exactly here... how is this different from my original proposal?

Signbank will provide status in gloss query about the status of video (processing,succeed (at which date). This can be checked periodically than continue to upload videos and see what happens with binding forever; thus reducing need for resources and latency.

I might be mistaken, but I think after the initial upload there are no other processing steps, now that the only check has been removed? That is, it's either uploaded to a gloss, or the upload failed?

History of uploaded videos, because there is no history what happens with video. We should be able to restore by accident deleted videos, this can happen without API and will be very welcomed by other users.

How about an endpoint GET dictionary/{gloss_id}/video_history, which returns a list of upload/delete dates for videos and ids to refer to them, and POST dictionary/{gloss_id}/restore_video with an id to restore a video uploaded earlier?

rem0g commented 1 month ago

A few responses to @rem0g :

Upload the video file to server directly, I don't see good reason for zipping the file besides special chars.

For speed, but I wouldn't mind doing the raw video file instead.

Videos are preprocessed already, but zipping video files slows down and compilates the process. Signbank has to download the provided URL and then unpack it, this can all be skipped by using simple video upload via POST form.

Signbank will automatically bind the signbank video with gloss. There is no reason this could only be done with call, Python and Django are very well able to process this and it's alot easier to implement.

Can you rephrase this? I'm unsure what you mean exactly here... how is this different from my original proposal? We have this: url = "https://signbank.cls.ru.nl/dictionary/upload_zipped_videos_folder_json and then url = "https://signbank.cls.ru.nl/dictionary/upload_videos_to_glosses/5/"

That is exactly what your proposal is about, but it should be:

url = "https://signbank.cls.ru.nl/dictionary/upload_video/5/" + POST form

then the video is uploaded and binded at once, becasue it retrieves video file in binary + associated post forms.

Signbank will provide status in gloss query about the status of video (processing,succeed (at which date). This can be checked periodically than continue to upload videos and see what happens with binding forever; thus reducing need for resources and latency.

I might be mistaken, but I think after the initial upload there are no other processing steps, now that the only check has been removed? That is, it's either uploaded to a gloss, or the upload failed?

SB does post status of other glosses in the response form when i upload entirely different glos video.

History of uploaded videos, because there is no history what happens with video. We should be able to restore by accident deleted videos, this can happen without API and will be very welcomed by other users.

How about an endpoint GET dictionary/{gloss_id}/video_history, which returns a list of upload/delete dates for videos and ids to refer to them, and POST dictionary/{gloss_id}/restore_video with an id to restore a video uploaded earlier?

Ok

susanodd commented 1 month ago

Django uploads files to a temporary storage location before they are put in the proper locations for the dataset/glosses. There is always this step.

susanodd commented 1 month ago

Because reversion is being used, Django potentially moves a bunch of video files whenever a new video is uploaded. It also creates objects for each upload A video upload is never just one operation.

susanodd commented 1 month ago

@Woseseltops there are two branches/pull requests now with migrations on the underlying video models. Can you checkout this code in conjunction with this issue? #1328, #1338

susanodd commented 1 month ago

For restoring videos from the video history a better backup system is needed. We could do the same as for gloss delete/restore and add the archived field. [NOPE, that still depends on the file system underneath.]

My opinion is that reversion needs to be removed. It makes a mess of stuff. Even if we try to program around it, it is still there sometimes. The video history has the operations but there are no videos in it. And if you keep doing operations they keep showing up in the history because it also has unsuccessful (historical) operations stored there. (That is something to do with the file system, because if something failed the objects have already been stored in the database, but perhaps not the file system, and vice verse. Django does not wait on the file system. If there is a solution to this, specific to Django?) (In the ajax calls with videos, the "user html that has no Django in it" needs to "explicitly buffer the file" before it can do the PUSH request. Otherwise the file is empty on the Django side.) That's why I did a zip file with multiple videos in the Manage Media page. Then the user also sees the contents of their zip file and sees the videos being processed. I set a limit to the "process next 10 videos" because the server could not keep up making objects for the videos!!!!! Those video files are already "on" the Signbank file system. It's just video objects that are being created.) We also ought to avoid unnecessary burden of the server with extreme amounts of url requests. (opinion) Avoid thrashing.

Woseseltops commented 1 month ago

There's some confusion around this issue; I'm still not 100% sure @rem0g and I are on the same page, @susanodd isn't sure about traffic... so in an attempt to remove this confusion, I'm going to build a first draft for this and see what everybody thinks.

rem0g commented 1 month ago

That's fine, but it's our requirement to upload video via POST form instead of zipping it. I still fail to see any reason to zip the video before uploading it, sorry.

susanodd commented 1 month ago

That's fine, but it's our requirement to upload video via POST form instead of zipping it. I still fail to see any reason to zip the video before uploading it, sorry.

The zip allows the special characters to remain inside the zip because you can give a simple ascii-based filename. Signbank has video files with Chinese characters in them, for example.

In #1341 you have glosses with special characters in them. We are working on locating the bugs.

rem0g commented 1 month ago

No, every video should be named with signbank id to prevent issues with gloss renaming.

Gomer

On Fri, Oct 11, 2024 at 06:42 susanodd @.***> wrote:

That's fine, but it's our requirement to upload video via POST form instead of zipping it. I still fail to see any reason to zip the video before uploading it, sorry.

The zip allows the special characters to remain inside the zip because you can give a simple ascii-based filename. Signbank has video files with Chinese characters in them, for example.

— Reply to this email directly, view it on GitHub https://github.com/Signbank/Global-signbank/issues/1332#issuecomment-2406543328, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACC4F7XT2CIWPSIPOUF6R3DZ25JKFAVCNFSM6AAAAABO5E6FQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBWGU2DGMZSHA . You are receiving this because you were mentioned.Message ID: @.***>

susanodd commented 1 month ago

No, every video should be named with signbank id to prevent issues with gloss renaming. Gomer …

Yes, of course. But the zip file itself is not the same as the video file inside that includes the annotation in the filename.

The intent of using a zip file is to shield the "information exchange over internet" from the special characters. Some of the (browser) code escapes the urls by putting "%" codes in them. The browsers do this. Inside the zip file, the actual video files can have any name. But in the past, some really old original signbank video code actually used those "%" in the filenames, then it won't match the annotation, obviously. (@Woseseltops wrote a lot of the original code, so hopefully he can investigate.)

Of course, we still need to figure out what is wrong with #1341.

susanodd commented 1 month ago

There probably needs to be totally different new code for the API videos. Some of the original code still is trying to convert to mp4. (A function ensure_mp4 still exists. But most of the called functions have commented this out because it was broken. The code to make small videos is also commented out. The code to generate a "still" for the video has also been commented out in some places because there was a bug in it if the "length" of the video ends up changing when it is converted to frames and resized.) All of this code is or is not necessarily called during video upload. @Jetske and @susanodd have introduced new code to try to circumvent some old code. But the old code has not been removed. Most of it is to do with the things mentioned.)

I report here all the buggy things that are messed up with videos. It's not always clear when code is being run or which code is being run.

rem0g commented 1 month ago

I agree the video upload needs a complete rewrite, ideally based on signbank ID as video name. This also goes for NMM and Voorbeeldzinnen videos, they all need ID's.

That way we have a better idea of what's going on instead of investigating problems in old codes which needs more people.

Woseseltops commented 1 month ago

Okay I got the external part working, I can now upload video to a Signbank server like this:

# The URL where the video file will be sent
url = 'http://35.159.203.216/dictionary/gloss/12/video'

# The path to your video file
video_file_path = 'glossvideo_ASL_FO_FORCE-834.mov'

# Auth
headers = {'Authorization': f'Bearer {BEARER_TOKEN}'}

# Open the video file in binary mode and send it in a POST request
with open(video_file_path, 'rb') as video_file:
    files = {'file': video_file}
    response = requests.post(url, files=files, headers=headers)

# Print the server's response
print(response.status_code)
print(response.text)

The internal processing of the file will follow probably next week.

Let me know what I can do to make it more convenient for you @rem0g ! From what you write, I'm thinking perhaps a generic video upload end point, so you're not forced to put the numeric ID of the gloss in the url?

susanodd commented 1 month ago

@Woseseltops can you take a look at the other issues about the videos? #1331 #1341 #673

The problem is that the database gets locked from the traffic.

susanodd commented 1 month ago

No, every video should be named with signbank id to prevent issues with gloss renaming. Gomer …

I will make this be an option for the existing zip archive so you can use merely the ID as the filename. (The Manage Media page also uses the same format for the zip archive.)

You ought to put like 20 or 50 videos in the zip file. Not just one video. This will reduce traffic to the server.

The code makes a streamed-json result to include all the results for all the videos that are in the same zip file.

Based on the ID, it is already possible to just lookup the gloss to retrieve its default annotation and its lemma (first two characters) in order to construct the path. @ocrasborn came up with using the annotation (without the ID) because normally the (non-API) user is not using the IDs.

susanodd commented 1 month ago

I posted many excerpts from the logs in the other issues. You can see that the database is getting locked because of the traffic. My opinion is still the same that the zip archive (with many videos) is better because it's less traffic.

We can modify the existing zip file upload request to put the file inside as a POST instead of as a URL parameter. (I need help with that POST part.)

susanodd commented 1 month ago

https://github.com/Signbank/Global-signbank/issues/1331#issuecomment-2410156648

susanodd commented 1 month ago

For applications handling substantial traffic, which translates to multiple workers across various server instances (depending on the scaling strategy adopted), concurrency inevitably needs to be managed.

How does it play out in Django? Well, when you use the inbuilt ORM to call the save() method on an object, there is a good chance that two different instances of the server call this method on the same object at the same time, causing the data to get corrupted.

How can we resolve this? Quite simple. A pessimistic approach dictates that you should lock the resource in question exclusively until you are finished with it.

When a database operation is in progress, the object or the set of updated objects must be locked until the operation is complete so that no other process can access this object. This will prevent multiple server instances from loading stale data into memory and corrupting the database.

rem0g commented 1 month ago

Okay I got the external part working, I can now upload video to a Signbank server like this:
# The URL where the video file will be sent
url = 'http://35.159.203.216/dictionary/gloss/12/video'

# The path to your video file
video_file_path = 'glossvideo_ASL_FO_FORCE-834.mov'

# Auth
headers = {'Authorization': f'Bearer {BEARER_TOKEN}'}

# Open the video file in binary mode and send it in a POST request
with open(video_file_path, 'rb') as video_file:
    files = {'file': video_file}
    response = requests.post(url, files=files, headers=headers)

# Print the server's response
print(response.status_code)
print(response.text)
The internal processing of the file will follow probably next week.

Let me know what I can do to make it more convenient for you @rem0g ! From what you write, I'm thinking perhaps a generic video upload end point, so you're not forced to put the numeric ID of the gloss in the url?

Yes perfect, will there also be callback if the video upload has succeed and processed in the database directly from the URL? If that will cause a long wait time I can adjust my scripts.

Jetske commented 1 month ago

@rem0g I added api creation with video upload of voorbeeldzinnen here: https://signbank.github.io/Global-signbank/#/Adding%20Signbank%20data/post_dictionary_api_create_annotated_sentence__datasetid__ Can you see if this way of uploading files works for you? I think it would be good to at least handle all kinds of api video uploads in a consistent way.

susanodd commented 1 month ago

What @Woseseltops proposes is that the API uses a file name is the path to the file destination in the Signbank file system, using underscore where slash is.

1347 (there are actually bugs in creating folders if they don't exist yet)

susanodd commented 1 month ago

@rem0g suggested (in #1331) we redesign the video upload. I think the current workflow is:

Name video files after glosses, store them in a zip

Upload a zip file to a server somewhere and grab its url

Call /dictionary/upload_zipped_videos_folder_json/{datasetid}/ and provide this url

Call /dictionary/upload_videos_to_glosses/{datasetid} to actually link them to the glosses.

I want to replace this with just one step, to be run for each gloss invidually:

POST /dictionary/gloss/{gloss_id}/zippedvideo with just the zipped file in the payload.

I don't think uploading the videos one by one has any real drawbacks in terms of overload, and it does make the whole process easier to follow: you upload a video to a gloss, and if it fails somewhere in the process, it can just return what went wrong for that one gloss.

Also, @rem0g suggests to remove the h264 check:

it's not needed to check whether it's h264. If it doesnt play, it wont play and we will know that one way or another.`

@Woseseltops the zip file contains an archive of many videos. Not just one video. It was intended to reduce traffic. For example, if you upload a zip file with 20 videos, it only needs two url requests, not 40.

I am working on making the zip archive inside the POST instead of as a URL. The original method gets the zip file from a URL because at the time, I had no idea how to put a zip file inside the POST. Tim helped me do it as a URL because you were on vacation then.

Woseseltops commented 1 month ago

Response to @rem0g :

Yes perfect

I'm not sure if you're responding to my Python example or my (alternative) proposal below it, but to make sure I don't slow things down even further I have finished the end point as in the example and put it up for review. Let me know if you're also interested in having a generic endpoint that doesn't force you to put the gloss ID in the URL!

will there also be callback if the video upload has succeed and processed in the database directly from the URL?

This is example output from using the script above, does that answer your question?

400
{"error": "No file uploaded"}

200
{"message": "Uploaded video of size 156919 bytes to dataset NGT."}

If that will cause a long wait time I can adjust my scripts.

In my test cases it is relatively quick! There are worries on our side that adding this end point might generate a lot of parallel requests, which will lock our database (which was not designed for mass usage). We could implement rate limiting, but instead it's probably easier to ask you to put a few seconds between each request if possible :)

rem0g commented 1 month ago

Yes perfect. I will take a look at it next week.

On Thu, Oct 17, 2024 at 14:35 Wessel Stoop @.***> wrote:

Response to @rem0g https://github.com/rem0g :

Yes perfect

I'm not sure if you're responding to my Python example or my (alternative) proposal below it, but to make sure I don't slow things down even further I have finished the end point as in the example and put it up for review. Let me know if you're also interested in having a generic endpoint that doesn't force you to put the gloss ID in the URL!

will there also be callback if the video upload has succeed and processed in the database directly from the URL?

This is example output from using the script above, does that answer your question?

400 {"error": "No file uploaded"}

200 {"message": "Uploaded video of size 156919 bytes to dataset NGT."}

If that will cause a long wait time I can adjust my scripts.

In my test cases it is relatively quick! There are worries on our side that adding this end point might generate a lot of parallel requests, which will lock our database (which was not designed for mass usage). We could implement rate limiting, but instead it's probably easier to ask you to put a few seconds between each request if possible :)

— Reply to this email directly, view it on GitHub https://github.com/Signbank/Global-signbank/issues/1332#issuecomment-2419423187, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACC4F7UIRIQLJPRKGIVGU2DZ36VIZAVCNFSM6AAAAABO5E6FQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJZGQZDGMJYG4 . You are receiving this because you were mentioned.Message ID: @.***>

Woseseltops commented 5 days ago

Okay I think the only remaining item here is this idea

How about an endpoint GET dictionary/{gloss_id}/video_history, which returns a list of upload/delete dates for videos and ids to refer to them, and POST dictionary/{gloss_id}/restore_video with an id to restore a video uploaded earlier?

I'm not sure what we have in terms of video history at all, so perhaps before I start: @rem0g , given approaching deadlines, do you still think it would make sense for me to work on this?

susanodd commented 5 days ago

You can browse the history in the admin to see whether it is useful.
When there were repeated attempts in the past because the video upload wasn't working , there were 74 entries for one of the DISNEY glosses.
Some filtering would be needed

Signbank / Global-signbank

Redo the API video upload #1332

1347 (there are actually bugs in creating folders if they don't exist yet)