Police-Data-Accessibility-Project / data-source-identification

Scripts for labeling relevant URLs as Data Sources.
MIT License
5 stars 6 forks source link

Annotation workflow v2 #19

Closed josh-chamberlain closed 6 months ago

josh-chamberlain commented 1 year ago

Context

v1 Existing doccano instance

Our volunteer @nfmcclure made us a doccano instance, which helped us label hundreds of URLs. This is an update to that original code, or a fresh start if needed. We need to label more data sources, and our next version of the pipeline needs to be more user-friendly to answer as many volunteer questions as possible.

Doccano instance: http://35.90.222.49:8000/projects/1 Comment or DM for access.

Doccano alternatives

Since we're starting fresh, we should probably use something like labelstudio. It's more fully featured. They support labeling rendered HTML, not just full text. This could really help us label a wide variety of things.

Requirements

Docs

mbodeantor commented 1 year ago

We should probably own user creation so we have some visibility into who's making changes

nfmcclure commented 1 year ago

I'll try to find some time soon (probably this week or next) to help migrate the instance over to digital ocean.

josh-chamberlain commented 1 year ago

thanks, @nfmcclure ! Much appreciated. Our pattern for other apps is to have a github repo on our org, set to auto-deploy over on DO. I'm guessing you won't have the permissions you need, and PDAP staff are happy to pick up where you get stuck.

josh-chamberlain commented 11 months ago

@nfmcclure a nudge, any update on the source code? we're happy to do any migration work, if that would help.

maxachis commented 8 months ago

@josh-chamberlain I'm willing to help tackle this. I have a few things I'll need to fully complete the task.

maxachis commented 8 months ago

As I understand it, this annotation workflow will include several components

maxachis commented 8 months ago

image Began working on Label Studio. Here is a very rough version of what the labeling process could look like. I gave it the url and the HTML content. It proceeded to render that HTML content and display it along with the url. I then gave a few sample options (not referencing the Taxonomy) to indicate how these choices might play out. Depending on the size of the options, I imagine it would be trivial to configure the option selection differently.

Label Studio can be configured to accept Source (Input) and Target (Output) Databases. Unfortunately, the non-local file options available for both are rather limited. I would need to know what our preferred method of storage would be, and what information I would need in order to properly load preprocessed data into the Source Database

image

The process for setting up Label Studio on Digital Ocean seems fairly simple.

Additional, lower priority tasks include:

My Next Task

Current Blockers (cc: @josh-chamberlain )

maxachis commented 8 months ago

38 is a draft pull request for the preprocessing pipeline, which would be included in the data source identification repository. Being separate from the rest of the logic in that repository, it could be easily moved elsewhere. However, because it's associated with this issue within the repo, I'm keeping it here for now.

The pipeline is relatively simple -- it loads the data from the relevant source, harvests the html data for each url and groups each with their respective urls, and then uploads that data to the relevant target. The source and target classes are designed to be easily substituted for the actual classes, and exist currently as placeholders and to guide the final implementation.

Once I obtain the information about the source and target databases discussed in the blockers above, I will be able to complete this and submit it as a full pull request.

My Next Tasks

josh-chamberlain commented 8 months ago

@maxachis yes, setting up in DigitalOcean should be pretty simple! In general we'll get things merged into a GitHub repo and working locally across our different machines, then Marty will push to a DO droplet.

I think it's up to @mbodeantor how to set up the target and source databases—I'm open to discussing this tomorrow if needed.

mbodeantor commented 8 months ago

@maxachis @josh-chamberlain Yeah I would like to discuss, I'm unclear about the use case of Label Studio vs Doccano

josh-chamberlain commented 8 months ago

@mbodeantor either way, we'll need to start from scratch—we can't get hold of the source code for Doccano, though from what I understand we had a pretty vanilla implementation.

The key difference is that Labelstudio allows for labeling richer content like embedded images / rendered html, whereas Doccano is text-only. Labelstudio also seems to have a more mainstream user base, better documentation, etc.

maxachis commented 8 months ago

label_studio_config.zip This file contains the HTML for the Expert Instruction, a JSON for the data types used, and a JSON and XML template version of the label config. In theory, this should be the primary stuff we need to quickly set up Label Studio on Digital Ocean.

josh-chamberlain commented 8 months ago

@maxachis nice, let's wait for @mbodeantor to make sure he's on board with using Label Studio.

maxachis commented 8 months ago

Currently working on setting up Droplet on Digital Ocean. Outstanding tasks.

maxachis commented 8 months ago

Label Studio version is online, of sorts. Can be accessed at http://167.71.177.131:8080

Hitting a possible blocker here in terms of Role Based Access Control -- specifically, the free version of Label Studio doesn't have it. That means that anyone who accesses our Label Studio instance can, if they are unscrupulous, change any settings they want in the project.

Now if we're comfortable with the honor system, that's not a problem. If we are, that would require some different restructuring.

The enterprise version of Label Studio would avert this problem, and would allow other functionality as well. However, the pricing for Enterprise is unknown -- we'd have to contact sales about it.

@josh-chamberlain My current question is as follows:

josh-chamberlain commented 8 months ago

@maxachis interesting. I reached out to them for pricing info.

Can we limit access to the entire instance, to at least control who can sign up? Allowing anyone to sign up is scary. If we have some control over who can sign up, that's different. If people have to come to us to get an account, I am comfortable with honor system for now—provided we periodically extract labeled content in case someone messes up / sabotages us. This worked just fine with Doccano.

I'm strongly against creating our own annotation pipeline from scratch. We can find a different free/cheap one which has RBAC if necessary...we are not the first to face this problem.

maxachis commented 8 months ago

There might be ways to limit access to the entire instance, but that would probably require us implementing things that go beyond the capabilities of free Label Studio. One option could be finding a way to dockerize our particular configuration of Label Studio and then hand that out to volunteers. If we configure it to point to cloud target and source databases, then in theory they would just need to spin it up and get started. But this wouldn't stop them from sharing it around or enable us to track their activity. And it does add an extra step (or several) to getting someone to help with annotation.

My personal opinion is that if we're willing to go to such length, then considering another annotator would probably be more effective for the effort.

maxachis commented 8 months ago

Additionally, in case this affects consideration of whether to use Label Studio, I will point out that the feature of displaying HTML would probably run into some problems if we tried to display the HTML of web pages we're looking at. Since sometimes the HTML of these pages rely on relative addressing, rendered HTML content absent the context of the web server it comes from might sometimes appear broken. Not always, but sometimes.

josh-chamberlain commented 8 months ago

@maxachis let's keep it simple and just try to annotate the text then: the URL plus meta and header content being collected by the tag collector. we can save displaying the page content for a future enhancement.

josh-chamberlain commented 8 months ago

regarding auth/users, I do have a call with labelstudio tomorrow to find out about enterprise pricing. I suggest:

josh-chamberlain commented 7 months ago

We have a labelstudio 2 week trial. Some things I'd like to test:

maxachis commented 7 months ago
  • [ ] annotation experience from perspective of different roles

@josh-chamberlain What are the full suite of roles we're envisioning here? Based on the interface, I can see two right off the bat:

  1. An individual assigning labels without any pre-existing labels
  2. Reviewers accepting/rejecting labels that already exist.

Any others I might be missing?

maxachis commented 7 months ago
  • [ ] assignment + correction process for annotations

I've created a simple URL taxonomy labeling task based on my original design, which anyone can try out easily enough. This is accessible on the project website as "URL Labeling and Annotation".

  • [ ] instead of labeling from scratch, use ML-generated suggestions for users to accept/reject

The relevant documentation on pre-generated predictions can be found here. The video example provided shows someone modifying a JSON file with a pre-existing prediction, which is then displayed in the annotation task as a pre-selected option. It does not show someone being able to accept or reject an option as though the annotation task has already been performed.

Thus, I have a few questions which I will investigate but which are also worth asking to the Label Studio team during free trial check-in

I will note that if the answer to 1 is that we can't skip the annotation task directly to the review/accept/reject portion, we could nonetheless make a workaround -- for example, by displaying a URL, the predicted classification, and a binary Approve/Reject option. And, if I'm understanding things correctly, we can bypass the review process such that these pseudo-review annotations are not reviewed a second time. This would also allow us to create a workaround for 2 as well -- in this case, the pre-annotated data is treated as contextual information, and the label is the manual "Approve/Reject".

maxachis commented 7 months ago

Another point of interest is being able to integrate a machine learning backend with Label Studio and creating an automated active learning loop. This could synchronize well with #41, "Make training happen on digital ocean". I would need to investigate the implementation further, however. And I may benefit from @EvilDrPurple and @mbodeantor 's insight into the machine learning pipeline and how easily we could integrate that into a Digital Ocean/Label Studio union.

I'm currently playing with their ML Loop example, and y'all can follow along with my forked version of the repo here if you're curious.

josh-chamberlain commented 7 months ago

@maxachis re: roles, 3. someone getting data into/out of label studio, or otherwise integrating with hugging face or the API

maxachis commented 7 months ago

Findings on Annotator and Reviewer Roles

cc: @josh-chamberlain

Administrators can create project and indicate • That annotators are manual • OR that annotations are automatically assigned to whomever next performs an annotation.

Assigning manual reviewers is fairly easy and intuitive. If I log in as annotator, I only see the projects I’m added to. Thus, for annotations, several steps need to be done for access:

  1. Project needs to be created (obviously)
  2. Project needs to be in a non-Sandbox workspace (somewhat less obviously)
  3. Project, after being created and moved into a non-Sandbox workspace, must be published (more less obvious)
  4. The user needs to be invited to the space and assigned as annotator
  5. The user needs to be added to the given project
  6. If the project involves manual annotations, the user needs to be manually assigned those annotations.

User experience as an Annotator is very user-friendly. Simple as click and go.

image

• Annotators can submit or skip. • Annotators can revisit tasks they previously performed.

Similar experience with reviewers. However, depending on the setting, reviewers can also annotate, so pay attention to settings.

I’d additionally note the project dashboard as providing useful information, such as how long it takes people to complete a task. Recommend looking at that for longer.

maxachis commented 7 months ago

@maxachis re: roles, 3. someone getting data into/out of label studio, or otherwise integrating with hugging face or the API

I'll look into this next. As I said before, there is the option for machine learning integration, but there also appear to be simpler options that can involve either manual import/export of data, or else hooking it up to cloud-based storage options such as Amazon S3.

maxachis commented 7 months ago

Note that the number of data sources available for Cloud Source Database storage (both for Source and Target database) is limited to:

maxachis commented 7 months ago

I've been able to set up a source data pipeline that can automatically pull in data for a particular project. A few observations:

maxachis commented 7 months ago

I'll additionally point out that Label Studio has an API which seems like it could be useful, albeit with some limitations:

https://labelstud.io/api

This might make components such as setting up users, linking to specific projects, and so forth easier.

UPDATE: Removed portion expressing uncertainty about whether we can directly assign roles to user via the API -- I have confirmed that we can.

mbodeantor commented 7 months ago

Looks like we can import data through the API: https://labelstud.io/api#tag/Import/operation/api_projects_import_create

maxachis commented 7 months ago

We can also export the data similarly through the API. These would probably be the better options to take, as opposed to hooking them up to cloud providers. Helps keep things more flexible.

maxachis commented 7 months ago

@josh-chamberlain @mbodeantor I have created a draft pull request at #47 that can serve as a proof-of-concept for demonstrating how to transfer data into and out of the project, utilizing the API. If we wish to go forward with Label Studio, this can be used as a starting point for further modifications.

I'll next work on modifying the data to test import pre-annotated data, to simulate what could be done with a machine learning pipeline.

maxachis commented 7 months ago

Observations from trial

  1. Label Studio's user interface is quite intuitive and easy to use. For the annotator/reviewer, I anticipate the process would be quite smooth.
  2. The API is quite powerful, capable of migrating data in and out of label studio, updating user roles, and a whole host of other actions. The documentation, similarly, is (mostly) quite clear and useful. However, many of the API actions are atomic -- more complex actions, such as rotating user roles to ensure we stay below our seating limit, will require substantial backend to chain together multiple API actions.
  3. Configuration of the projects and tasks is the primary pain point -- the XML templates for setting up a project are useful, but not optimally documented, and errors are not always clear. Similarly, data must be precisely configured, as errors in the data can cause errors in the application which are not always easy to identify and diagnose. This is an issue that will be most present in the setup of the task, but seemingly won't cause too much issue afterwards. But we do need to be certain that the task is properly configured and all task data going into it is properly set up.
maxachis commented 7 months ago

On the active learning functionality and deeper ML integration

It is interesting, and I think we could benefit from utilizing it. By selecting only the samples our machine learning model is most uncertain about, we could solve the problem we've had of certain training data being underrepresented. However, that process would require a more complicated setup, and probably would benefit from having an active learning setup already developed. Thus, it might not be useful to explore right now, given the limited amount of time we have on this trial.

Additionally, the documentation for the machine learning portion is lacking and in some cases appears to contain contradictions. For example: model.py in the Github repository for the Label Studio ML Backend includes two methods, fit() and predict(), which include parameters which are not included in the dummy model example, despite the dummy model indicating that the class defined in model.py is the parent class. Since these two methods are apparently the means by which the machine learning backend would interface with Label Studio, it's a concern to me that, between these two sources, I don't know which to treat as authoritative.

maxachis commented 7 months ago

Questions for Label Studio Team

I'll update this comment with additional questions as I progress:

maxachis commented 7 months ago

Creating/Rotating Users

Can be done using the API. I've linked to the relevant commands

I updated my PR to include functionality for updating a member's role, as well as an integration test demonstrating this.

josh-chamberlain commented 7 months ago

We got some responses:

It was nice meeting you today. To follow up on our conversation today, here are the answers to your questions:

  1. Your assumptions were correct here. (the dummy model is out of date)
  2. You can achieve this by inserting the pre-annotations in the "annotations" key in the json file format with the user information, if needed. Here is the doc for example. And you can leave the "prediction" key empty.
  3. Correct, pre-annotations are only supported in JSON format, at the moment, we do not support CSVs for pre-annotations
  4. I have attached in this email a json file sample where you can load pre-annotations as submitted annotations. Keep in mind, if you omit the annotator info, the software will automatically show the user that created the tasks as the annotator. Here is a sample python code for the API call to create the tasks from the json:
import requests
import json

def load_json(input_file):
    with open(input_file) as f:
        json_file = json.load(f)
    return json_file

api_key = "your_api_key"

project_id = 00000

url = f"https://app.heartex.com/api/projects/{project_id}/import"

# or just load the json directly from the script as a dict
data =  load_json("json_imports/fish_anno.json")

headers = {
    "Authorization": f"Token {api_key}",
    "Content-Type" : "application/json"
}

response = requests.post(
    url=url,
    headers=headers,
    json=data
)

print(f"Response code: {response.status_code}, Response body: {response.text}")

fish_anno.json

maxachis commented 7 months ago

I'm working on creating code that can convert our data into the requisite format for pre-annotations.

Bear in mind, the data must be in a very precise format, which is not always optimally documented.

I may also need @EvilDrPurple 's insight as to how the label data as output from the ML pipeline is currently represented, as I will need to know how to convert data from that format into Label Studio's bespoke format.

maxachis commented 7 months ago

Can confirm I've successfully been able to import pre-annotated data into Label Studio, which can then be reviewed and thereby bypassing the annotation stage. In other words, we can create a full pipeline with either unannotated or pre-annotated data.

My next priority will be to create an example pipeline, using fake data, that people can run which can illustrate how it would work.

I'll be putting aside a demonstration of the programmatic user rotation functionality (which we'd need to decide if we want to pursue) as well as the active machine learning, which is not Minimum Viable Product for this issue.

maxachis commented 7 months ago

I have created and linked #47, a draft PR that at the moment mainly exists to demonstrate the functionality of Label Studio and how it would look to utilize it in a (simplified) pipeline.

@josh-chamberlain @mbodeantor I invite y'all to check it out and see

  1. If you can run basic_demonstration.py successfully
  2. If the workflow makes sense and is what you'd expect.
maxachis commented 7 months ago

@josh-chamberlain Since it's been over a week, I wanted to additionally ping you on this, in case it got lost in the shuffle.

josh-chamberlain commented 7 months ago

@maxachis sorry about the delay, I wasn't getting notifications. I'm looking at this now.

josh-chamberlain commented 6 months ago

@maxachis I made a project called Labeling interface

Aside from the fact that the actual text we're labeling will look different, this is what I was expecting the process to look like. What do you think about making this canonical, and calling this issue closed? If the project is set up, we just need to hit it with tasks.

It's easy enough to make these 3 separate labeling tasks—but I think it's better if each URL only goes through the pipeline once, because it takes time for someone to read and understand what they're looking at.

screencast 2024-04-03 14-29-03

maxachis commented 6 months ago

@maxachis I made a project called Labeling interface

Aside from the fact that the actual text we're labeling will look different, this is what I was expecting the process to look like. What do you think about making this canonical, and calling this issue closed? If the project is set up, we just need to hit it with tasks.

It's easy enough to make these 3 separate labeling tasks—but I think it's better if each URL only goes through the pipeline once, because it takes time for someone to read and understand what they're looking at.

@josh-chamberlain This interface is wayyy better looking and comprehensible than what I came up with, so no complaints there.

I also have no issue with having this be one task, and in fact thinks it's probably considerably easier for the user as well that way.

I'm also happy to close this issue. I think after this we'd just need to create one or two issues for the process of ETL'ing data into and out of this.