airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Additional NucelicAcidProcessing fields #87

Closed williamdlees closed 1 year ago

williamdlees commented 6 years ago

Given the otherwise comprehensive nature of this table, I suggest that it should explicitly contain the primer sets used for amplification. Having explicit fields for them would ensure that they are present and easy to find.

bussec commented 6 years ago

For MiAIRR we decided against describing the primer themselves, as it can be hard to predict where they would bind, especially for degenerated ones and primer mixes. Thus there are currently only the two forward_pcr_primer_target_location and reverse_pcr_primer_target_location fields, to indicate the "guaranteed non-templated" part of the sequence. Was there any application that we missed?

schristley commented 6 years ago

This would definitely be useful for performing re-analysis from the raw data. Right now it requires careful extraction from the supplementary information of the publication. I see two interests here, 1) the primer sequences used for biological processing and 2) the primer sequences used in software processing. From the few publications I've analyzed, the sequences described for 1) don't seem to always be the sequences used for 2). Take for example this paper, in the supplemental material one primer is listed as such:

CCATCTCATCCCTGCGTGTCTCCGACTCAG TAAAAGGTGTCCAGTGT

yet you cannot use all of this for software processing, only the latter TAAAAGGTGTCCAGTGT is relevant. From a software processing perspective, it would be nice if such primers were provided as FASTA files.

williamdlees commented 6 years ago

The target locations are useful, but is it not possible that we will need to store multiple locations where a mix of primers are used?

Appreciate that target locations are much easier to work with from a processing point of view. Is it worth storing the primer sequences themselves, in the interests of reproducability and verification? Or is that not a concern?

bussec commented 6 years ago

In terms of experimental reproducibility the primer sequence is only one piece information. Therefore it needs to be reported with other conditions (e.g. master mix, thermocycler settings). The advantage of primer sequences is that people usually have them at hand, while hardly anyone knows the exact positions.

williamdlees commented 6 years ago

Agreed that the primers are only one piece of information required for reproducability, but they are necessary for processing raw sequences, and, as Scott has mentioned, frequently hard to find in publications and in some cases can't be copied and pasted. It's relatively common for researchers to refer to previous publications for primer sets, sometimes multiple publications, which raises the possibility of misunderstandings and even errors creeping in. I think their explicit inclusion would be helpful. And the point about the sequences being to hand is an important one, I think. There is effort and potential error in deriving target locations.

Re. positions, in my current project, which I don't think is untypical, we are using a constant region primer for IgG which is at a different location to the other heavy chain primers in order to distinguish subtypes. This doesn't map cleanly onto a single target location for the repertoire, and to use the 'safe' locations for all sequences would mask the valuable subtype information.

How is the target position specified, given in particular that it may be outside the v-region? Apologies if I have missed this in the documentation.

bussec commented 5 years ago

No, the documentation does currently not specify this. As the (forward|reverse)_pcr_primer_target_location keywords are flagged as AIRR_CUSTOM regarding SRA submissions, there is no NCBI standard that we would have to comply to.

My naive assumption is that we are using classical "biological" numbering, meaning that "1" denotes:

In this scheme, positions in the 5' UTR would be denoted by negative values, starting at "-1" for the base directly 5' of the ATG (so there is no 0). We can discuss whether there is any actual use for this, as this would mainly be apply to 5'-RACE protocols which do not use sequence specific primers.

scharch commented 1 year ago

closed as stale - MiAIRR now in maintenance mode