bdekosky / PNAS_2015-25510

3 stars 1 forks source link

null and duplicate sequences #1

Open Pezhvuk opened 6 years ago

Pezhvuk commented 6 years ago

Dear @bdekosky ,

I have a few questions.

firstly, could you please explain what the splits are in the fasta files.

secondly, are the fasta files provided meant to be the pre-processed final paired-datasets?

Thirdly, I found some duplicated sequence headers, where one of these headers would not have any sequence, whereas the other one would.

Additionally, There are lots of identical sequences across different splits (mostly light chains), which, I suspect, is just due to the underlying biology and not any errors. Is that right?

Thank you, Pej.

bdekosky commented 6 years ago

Dear Pej,

No problem at all, and thanks again for your interest.

-Exactly which fasta files are you referring to?  As we have many fasta files related to this project... Can you provide an example of the splits that you are referring to?

-I am not sure how the duplicated sequence headers happened, but if you could show an example that would be very helpful

-The identical sequences across different samples, particularly for the light chain, is due to the underlying biology.  We actually discuss this phenomenon in doi:10.1038/nm.3743 (especially Figure 3), what we refer to as "public" VL genes that are repeated across individuals.  In many cases, heavy chain sequences are repeated across multiple biological replicates derived from the same human blood sample, which could result from /in vivo/ or /in vitro /clonal expansion.

I hope this helps.  Let me know anything I can do to assist!

Best, Brandon

On 7/3/2018 10:53 AM, Pezhvuk wrote:

Dear @bdekosky https://github.com/bdekosky ,

I have a few questions.

firstly, could you please explain what the splits are in the fasta files.

secondly, are the fasta files provided meant to be the pre-processed final paired-datasets?

Thirdly, I found some duplicated sequence headers, where one of these headers would not have any sequence, whereas the other one would.

Additionally, There are lots of identical sequences across different splits (mostly light chains), which, I suspect, is just due to the underlying biology and not any errors. Is that right?

Thank you, Pej.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bdekosky/PNAS_2015-25510/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AIk4_dK0fTcTkEB5-CiLVneeErdhggVRks5uC5OAgaJpZM4VBNnk.

bdekosky commented 6 years ago

We also discuss the existence, the prevalence, and the genetics of public antibody gene sequences extensively in the PNAS paper.  Just search for the term "public" and you should find extensive discussion of this

On 7/3/2018 10:53 AM, Pezhvuk wrote:

Dear @bdekosky https://github.com/bdekosky ,

I have a few questions.

firstly, could you please explain what the splits are in the fasta files.

secondly, are the fasta files provided meant to be the pre-processed final paired-datasets?

Thirdly, I found some duplicated sequence headers, where one of these headers would not have any sequence, whereas the other one would.

Additionally, There are lots of identical sequences across different splits (mostly light chains), which, I suspect, is just due to the underlying biology and not any errors. Is that right?

Thank you, Pej.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bdekosky/PNAS_2015-25510/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AIk4_dK0fTcTkEB5-CiLVneeErdhggVRks5uC5OAgaJpZM4VBNnk.