StanfordBioinformatics / pulsar_lims

A LIMS for ENCODE submitting labs.
3 stars 1 forks source link

ENCODE data submission: manually submission for will-green lab #481

Open twang15 opened 3 years ago

twang15 commented 3 years ago

Hi Tao,

We have a submission for Will Greenleaf’s lab that I worked on for a while now. It’s a bit more complicated and I was wondering if we could have a call this week and we go through the spreadsheet together and I explain you everything? I can’t meet on Thursday during our usual time because I will be at the German embassy on Thursday. I could talk tomorrow after our DCC meeting or Wednesday morning. Let me know what works best for you.

Here’s a link to the spreadsheet already. I will explain you the tweaks when we meet. https://docs.google.com/spreadsheets/d/1kxFkyQg19nLxj6kdhBS_pN8dzLp9xrNh/edit#gid=1812279104

Thanks! Annika

twang15 commented 3 years ago

Hi Annika,

We can have a discussion tomorrow morning after meeting with DCC.

Here is the github issue for this submission: https://github.com/StanfordBioinformatics/pulsar_lims/issues/481

Best, Tao

twang15 commented 3 years ago

Download Reference file to scg

twang15 commented 3 years ago

Hi Annika,

Could you help me prepare the submission sheet for green leaf lab? In tab “Reference_Tao”, please fill the last column (marked red).

https://docs.google.com/spreadsheets/d/1kxFkyQg19nLxj6kdhBS_pN8dzLp9xrNh/edit#gid=1329654463

Thanks, Tao

twang15 commented 3 years ago

Reference file uploaded

twang15 commented 3 years ago

Hi Annika,

https://docs.google.com/spreadsheets/d/1kxFkyQg19nLxj6kdhBS_pN8dzLp9xrNh/edit#gid=1292157095

For FunctionalCharacterizationExperiment, assay_term_name should be one in the following list:

    "CRISPR screen",
    "MPRA",
    "perturbation followed by scRNA-seq",
    "perturbation followed by snATAC-seq",
    "pooled clone sequencing",
    "STARR-seq"

Which one is the right one?

Thanks, Tao

twang15 commented 3 years ago

functional_characterization_experiment is the right profile for Functional characterization experiment Many others are available.

twang15 commented 3 years ago

perturbation followed by snATAC-seq

Thanks, Annika

twang15 commented 3 years ago

Hi SCG administrators,

I (taowang9) need read permission on /oak/stanford/groups/wjg/jgranja/GEO_SpearATAC .

Could you please help me out?

Best, Tao

twang15 commented 3 years ago

Hi Annika,

https://docs.google.com/spreadsheets/d/1kxFkyQg19nLxj6kdhBS_pN8dzLp9xrNh/edit#gid=1812279104

I remember that only a subset of all the files need to go to the portal. Could you please let me know the details?

Thanks, Tao

twang15 commented 3 years ago

Hi SCG administrators,

I (taowang9) need read permission on /oak/stanford/groups/wjg/jgranja/GEO_SpearATAC .

Could you please help me out?

Best, Tao

Tao,

That directory has open permissions. I've added execute bit for you on /oak/stanford/groups/wjg/jgranja so that you can get to it. Try it now.

Best,

Taymoor Arif

twang15 commented 3 years ago

Hi Tao, I marked everything red that should be uploaded in your file tab. It’s all the files with file_format “fastq” or “bam”.

Let me know if you have questions, Annika

twang15 commented 3 years ago

Hi Annika,

https://docs.google.com/spreadsheets/d/1kxFkyQg19nLxj6kdhBS_pN8dzLp9xrNh/edit#gid=1467081546

Platform is required for file submission. But the platform “Illumina NextSeq 550” does not exist on the portal.

Could you please help register it?

Thanks, Tao

twang15 commented 3 years ago

Hi Annika,

The platform is now registered at: https://www.encodeproject.org/platforms/NTR:0000655/ It also has the alias encode:NextSeq550.

Best, Jennifer

twang15 commented 3 years ago

Hi Jennifer,

Thanks for the registration! However, I cannot access it at this time. Please grant me access.

Error message: Not available

Your account is not allowed to view this page.

Best, Tao

twang15 commented 3 years ago

Hi Sarah,

Could you answer the following questions for us?

Since we are going to model the processing, can we get some details expanding upon “To see an example analysis for these fastq files see https://github.com/GreenleafLab/SpearATAC_MS_2021/tree/main/AlignSgRNA. You can find example fastqs and the pipeline used to process these fastqs into a table corresponding to each sgRNA.” ? We need essentially to know tools, versions used to process the data.

Thanks, Tao

twang15 commented 2 years ago

Hi Annika and Tao,

I was reviewing the SPEAR data and came up with the following FASTQs that failed validation:

https://www.encodeproject.org/report/?type=File&submitted_by.@id=%2Fusers%2F9e077d38-a99b-4f84-8c79-6c75cf505731%2F&status=content+error&field=%40id&field=aliases&field=content_error_detail

I see that some of the errors are due to suspected duplication between files. For example the file ENCFF306PQP will-greenleaf:scATAC_GM_LargeScreen_Rep4_R1_001.fastq.gz Seems to be “conflicting” with ENCFF661VYE and ENCFF925HWL, ENCFF606KQM, ENCFF178OBX, … What I find confusing is that some of the files belong to other experiments(for example ENCFF306PQP is from ENCSR820VEF (will-greenleaf:sp_experiment8) but ENCFF661VYE is from ENCSR138NWM (will-greenleaf:sp_experiment7)) Perhaps all it is a demultiplexing issue, but I want to check before I let these file through.

Thanks,

Idan

twang15 commented 2 years ago

Hi Idan,

I double-checked the original submission sheet, and did not find any problem on my side, unless that original experiment grouping was wrong.

@Amy, could you help elaborate this issue?

Thanks, Tao

twang15 commented 2 years ago

Meeting memo: Annika, Tao, Idan, Jennifer

  1. Annika to confirm with Sarah whether fragments and csv files need to be combined into one tar file for submission as part of existing scATAC experiments
  2. Idan sends detailed guideline to Tao on how to submit sgRNA experiments and files
twang15 commented 2 years ago

Found an email form August with the answer in regard to sgRNAs. Tao, please feel free to contact me if the answer below is not making sense.

Idan

twang15 commented 2 years ago

Hi Sarah,

I’ve prepared the submission sheet for the “pooled clone sequencing” FunctionalCharacterizationExperiment sgRNA.

Could you please provide more information about the libraries “Library-sgRNA” in this spread sheet? https://docs.google.com/spreadsheets/d/1kxFkyQg19nLxj6kdhBS_pN8dzLp9xrNh/edit#gid=1322748522

Thanks, Tao

twang15 commented 2 years ago

Hi Tao,

The following 6 "elements reference" files are flagged with a "content error": https://www.encodeproject.org/report/?type=File&submitted_by.%40id=%2Fusers%2F9e077d38-a99b-4f84-8c79-6c75cf505731%2F&field=%40id&field=output_type&field=aliases&field=content_error_detail&sort=output_type&lab.title=Will+Greenleaf%2C+Stanford&status=content+error&status=uploading&output_type=elements+reference

This is because the portal expects .fasta files to be gzipped (.fasta.gz). Would you be able to gzip the 6 files and then reupload?

Please let me know when you are ready to do so, and I can reset the status of the files for that.

Separately, the DCC was able to resolve the validation errors on the index reads files after adding read structure metadata. For the reads with potential duplication errors, we are awaiting further info.

Thanks! Jennifer

twang15 commented 2 years ago

Hi Jennifer,

These files are ready for resubmission. Please reset the status for me.

Best, Tao

twang15 commented 2 years ago

Hi Tao,

Thanks! I have changed the status to uploading, please feel free to resubmit the files now.

Jennifer

twang15 commented 2 years ago

Hi Jennifer,

I tried but failed with the following error. Could you please check the credential status again?

2021-11-02 14:22:20,383:eu_debug: Attempting to generate new file upload credentials 2021-11-02 14:22:20,534:eu_debug: Error 403: unable to re-issue upload credentials for 'ENCFF617HLE' 2021-11-02 14:22:20,537:eu_debug: { "@type": [ "HTTPForbidden", "Error" ], "code": 403, "description": "Access was denied to this resource.", "detail": "status must be \"uploading\" to issue new credentials", "status": "error", "title": "Forbidden" }

Thanks, Tao

twang15 commented 2 years ago

Hi Jennifer,

I tried but failed with the following error. Could you please check the credential status again?

2021-11-02 14:22:20,383:eu_debug: Attempting to generate new file upload credentials 2021-11-02 14:22:20,534:eu_debug: Error 403: unable to re-issue upload credentials for 'ENCFF617HLE' 2021-11-02 14:22:20,537:eu_debug: { "@type": [ "HTTPForbidden", "Error" ], "code": 403, "description": "Access was denied to this resource.", "detail": "status must be "uploading" to issue new credentials", "status": "error", "title": "Forbidden" }

Thanks, Tao

This resubmission is done.

twang15 commented 2 years ago

Hi Annika and Idan,

Here is what has been finished since our last meeting, and I want to update the current status of this submission:

  1. Idan sends detailed guideline to Tao on how to submit sgRNA experiments and files [Done]
  2. Tao and Jennifer to patch the 6 "elements reference" files. [Done] The current status:
  3. Tao is waiting Sarah’s inputs from Sarah on the library-sgRNA submission: . [In progress] https://docs.google.com/spreadsheets/d/1kxFkyQg19nLxj6kdhBS_pN8dzLp9xrNh/edit#gid=1322748522
  4. Annika to confirm with Sarah whether fragments and csv files need to be combined into one tar file for submission as part of existing scATAC experiments. [In progress] I do have some concerns about the progress since Sarah is not that reachable these days.

Best, Tao

twang15 commented 2 years ago

Sure, Annika. I think I will be able to move forward once Sarah finishes the library-sgRNA sheet. There may be other issues popping up later, but I will let everyone know if any.

Best, Tao

twang15 commented 2 years ago

From Sarah:

Okay sorry I need a little more clarification, so I am only filling out the "Library-sgRNA" sheet in that Excel file? And should there be a row for every sgRNA sample that was submitted (so there should be 19 total)?

For the construction platform, it was a homemade protocol rather than a kit -- how should I indicate that?

And then how do I link the information between the "Library-sgRNA" sheet and the "Files-sgRNA sheet" so that you know which library coordinates with which Files?

twang15 commented 2 years ago

Hi Sarah and Idan,

Idan, could you comment on the construction platform field?

Sarah, for the other two questions,

my understanding is that for each biosample, there is one library. But I am not sure whether a sgRNA library shared the same biosample as a spear-ATAC. Do they share the biosample in your experiment? If so, we may use the same registered biosample ENC on the portal.

For the link between a library and files, it is established through another sheet: “Replicate-sgRNA” (https://docs.google.com/spreadsheets/d/1kxFkyQg19nLxj6kdhBS_pN8dzLp9xrNh/edit#gid=1253730928), you can refer to sheet “Replicate-Tao” as a starting point.

Please note the submission is multiple step process, we will need revisit the other sheets once the experiments and libraries have been submitted.

Best, Tao

twang15 commented 2 years ago

Hi Sarah, Will,

The problem with the files from your experiments (and this is why they are flagged) is that the headers/seq identifies in the files are not unique, also between files that belong to different experiments. The DCC validator picked this up for several files in your experiments. This happens sometimes when samples were multiplexed and unique sample information is not retained when demultiplexed. Someone from the Greenleaf team has to check that the files were properly demultiplexed and that the information given to me/Tao reflects the correct file names.

Please let is know if there are questions,

Annika

twang15 commented 2 years ago

Hi Sarah and Will,

There is another problem that I noticed. DCC requires the index read of spearATAC experiments to have “read_structure” information.

You may need to schedule a meeting with Ingrid and Jennifer to discuss the details for this field and let me know the result. This information depends on how the experiment was done and we cannot solve it without your inputs.

Best, Tao

twang15 commented 2 years ago

Tao,

{"sequence_element": "barcode", "start": 1, "end": 16}

Thanks, Annika

twang15 commented 2 years ago

From Sarah:

Hi Annika,

I don't think I understand what you mean by "the headers/seq identifies in the files are not unique, also between files that belong to different experiments." Do you mean the names of the files themselves? From what I see in "FunctionalCharacterizationExperiment_Tao" which lists all of the files, they all appear to have unique names. We know that the files were demultiplexed properly because we went forward and analyzed the data from there -- it would have been quite obvious if any of the reads were from the wrong experiment (diff cell lines, diff sgRNAs, etc.).

twang15 commented 2 years ago

Hi Sarah,

If library construction was done using “homemade” protocol you should send the PDF to Tao or me, we would register that on the portal and then we can associate it with the libraries in question.

In regard to your question about non-uniqueness:

If the DCC looks on two FASTQ files with read names like: FILE1 @INST1:234:LXH5YIU:3:12:235:11:34 1::0:N

FILE2 @INST1:2122334:LXH5YIU:3:11:231:12:3 1::0:N

We will flag these FILE1 and FILE2 as suspects for duplication, that is done because the read names suggest the flowcell and the lane of the reads in these two files were the same LXH5YIU:3. We understand that this type of “potential” duplication may result from a valid demultiplexed files, but do not have a simple or efficient way to validate correctness - so rely on the labs taking a look and making sure there was no erroneous submission of duplicates of FASTQs that were not supposed to have this apparent similarity in the read names.

I hope that explains Annika’s question.

Idan

twang15 commented 2 years ago

https://docs.google.com/document/d/18HWBgcL8nrYF90-JcYRWcwAOJOuQenXJm-6X_4HeK3s/edit#

TODO for Tao, 11/16/2021, DCC meeting

Submit the SpearATAC processed data in one .tar file

twang15 commented 2 years ago

From William Greenleaf:

Are we all set for these data uploads then? W

twang15 commented 2 years ago

I think we are waiting for Sarah’s response?

Idan

twang15 commented 2 years ago

Hi Idan,

I attached the protocol for making the sgRNA libraries.

Also yes, the files were demultiplexed appropriately.

Thanks!

Best, Sarah

twang15 commented 2 years ago

Thank you Sarah, we will upload the protocol and will patch the FASTQs that apparently have no duplication.

Tao, I am not sure where you stand in regard to the creation of pooled-clone-sequencing experiments?

Thanks,

Idan

twang15 commented 2 years ago

Hi Idan and everyone,

Thanks for pushing this forward.

I am working on it now, and will keep everyone posted.

Best, Tao

twang15 commented 2 years ago

Hi Idan,

I have one quick question:

We decided to combine scATAC_K562_Pilot_Rep1.fragments.tsv.gz and scATAC_K562_Pilot_Rep1.singlecell.csv into one tar ball. Could you please let me know what “output_type” should we use for the tarball submission?

Thanks, Tao

twang15 commented 2 years ago

The last groups of paired files for scATAC was tarred into one file and have been submitted.

twang15 commented 2 years ago

Hi Idan,

The sgRNA experiment submission fails the validation. But I can only see the following client-side error message.

Could you please take a look and share your thoughts?

Thanks, Tao

2021-11-22 10:23:36,781:eu_debug: <<<<<< POST functional_characterization_experiment record will-greenleaf:Spear_ATAC-K562-Pilot_sgRNA To DCC with URL https://www.encodeproject.org/functional_characterization_experiment and this payload:

{ "aliases": [ "will-greenleaf:Spear_ATAC-K562-Pilot_sgRNA" ], "assay_term_name": "pooled clone sequencing", "award": "/awards/UM1HG009436/", "biosample_ontology": "/biosample-types/cell_line_EFO_0002067/", "elements_mappings": [ "ENCSR858ICE" ], "elements_references": [ "ENCSR867SMQ" ], "lab": "/labs/will-greenleaf/", "plasmids_library_type": "gRNA cloning" }

2021-11-22 10:23:36,928:eu_debug: Failed to POST will-greenleaf:Spear_ATAC-K562-Pilot_sgRNA 2021-11-22 10:23:36,929:eu_debug: <<<<<< DCC POST RESPONSE: 2021-11-22 10:23:36,932:eu_debug: { "@type": [ "ValidationFailure", "Error" ], "code": 422, "description": "Failed validation", "errors": [ { "description": "{'elements_mappings': ['ENCSR858ICE'], 'elements_references': ['ENCSR867SMQ'], 'lab': '/labs/will-greenleaf/', 'award': '/awards/UM1HG009436/', 'biosample_ontology': '/biosample-types/cell_line_EFO_0002067/', 'assay_term_name': 'pooled clone sequencing', 'aliases': ['will-greenleaf:Spear_ATAC-K562-Pilot_sgRNA'], 'plasmids_library_type': 'gRNA cloning'} is not valid under any of the given schemas", "location": "body", "name": [] } ], "status": "error", "title": "Unprocessable Entity" }

twang15 commented 2 years ago

Hi Tao,

Please add

"control_type": "control",

and try again

Idan

twang15 commented 2 years ago

Hi Idan,

It works. The next question is about the construction_platform for the sgRNA libraries. Do you know which platform we should use?

Thanks, Tao

twang15 commented 2 years ago

If you have 10X kit/platform identifier we can register a new Platform object on the portal to link to. If not, we can leave it not specified.

Idan

twang15 commented 2 years ago

Hi Idan,

This is the 10X identifier that I used in Snyder Lab’s submission. https://www.encodeproject.org/platforms/NTR:0000452/

Can we use the same one for the current spearAtac submission?

Thanks, Tao

twang15 commented 2 years ago

When I read Sarah’s sgRNA library generation protocol, no 10X specific details were specified, so I am not sure. It is a question for Sarah (or someone that knows exactly how the experiment was performed).

Idan

twang15 commented 2 years ago

From Sarah:

No 10x specific protocols were used to generate the sgRNA libraries after the scATAC libraries were made. Please note that the sgRNA libraries are derivatives of the scATAC libraries -- e.g. the scATAC library is the starting material to create the sgRNA library, and therefore the sgRNA library is just a subset of the scATAC library. I hope that clarifies things.

twang15 commented 2 years ago

Thanks Sarah,

Tao please submit these libraries without platform info.

Idan