SasView / sasmeta

BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Port the Biological Scattering Tools website to smallangle.org #3

Open smk78 opened 3 years ago

smk78 commented 3 years ago

This is not strictly a SasView issue, but is concerned with a related analysis tool.

In late 2019 a round of emails between @smk78 , @timsnow , @butlerpd and @ajj agreed that CanSAS would take over the hosting/maintenance of the Biological Scattering Tools website currently hosted at psldc.isis.rl.ac.uk . This site is most notable for its BSLDC tool which gets a lot of use. But the developer behind the site and calculator left employment with ISIS/Diamond several years earlier.

The intention was to re-establish the site at http://psldc.smallangle.org, but that link currently points to the DANSE SANS project, the default for addresses on smallangle.org that do not have a specific DNS entry.

@ajj confirms that we are free to use the address and appropriate virtualhost setup.

This task therefore probably needs someone to edit the Apache configs. @ajj may need to add the DNS entry as he owns smallangle.org (unless he has permitted more admins?).

smk78 commented 3 years ago

@smk78 has requested a zip/tar bundle of the requisite files from the current hostmaster at ISIS.

pkienzle commented 3 years ago

The periodictable package already does most of the calculation so it may be available within the sasview sld calculator.

The results I'm getting from the online activation calculator are somewhat different from those for BSLDC. Can someone with bio knowledge cross-check the code and update periodictable to produce the correct answer?

Compare the β-casein protein. The number of H (1370+371 vs 1643.5) and O (309 vs 305) in the formula are different though C and N are the same. The estimated density also differs (1.27 g/cm³ vs 1.35 g/mL).

Activation calculator

Scattering from aa:RELEELNVPGEIVESLSSSEESITRINKKIEKFQSEEQQQTEDELQDKIHPFAQTQSLVYPFPGPIPNSLPQNIPPLTQTPVVVPPFLQPEVMGVSKVKEAMAPKHKEMPFPKYPVEPFTESQSLTLTDVENLHLPLPLLQSWMHQPHQPLPPTVMFPPQSVLSLSQSKVLPVPQKAVPYPQRDMPIQAFLLYQEPVLGPVRGPFPIIV Source neutrons: 1.000 Å = 81.80 meV = 3956 m/s Source X-rays: 1.542 Å = 8.042 keV Sample in beam: C1080H1370H[1]317N268O309S6 at 1.27 g/cm³

1/e penetration depth (cm) Scattering length density (10-6/Ų) Scattering cross section (1/cm) X-ray SLD (10-6/Ų)
abs 51.526 real 1.681 coh 0.033 real 11.554
abs+incoh 0.208 imag -0.000 abs 0.019 imag -0.027
abs+incoh+coh 0.207 incoh 20.329 incoh 4.786

Neutron transmission is 0.819% for 1 cm of sample (after absorption and incoherent scattering). Transmitted flux is 8.186e+5 n/cm²/s for a 1e8 n/cm²/s beam. Contrast match point: 38.1% D2O by volume (real SLD = 2.088×10-6/Ų)

BSLDC tool

Table 1. The Total number of residues, Chemical composition, Molecular weight in kilodaltons, scattering length, scattering length density (ρ), molecular volume and number of exchangeable hydrogens for the inputed protein given it's sequence, % deuteration, % D2O concentration and % exchange.

The sequence submitted was called "Not entered"

The sequence submitted was "RELEELNVPGEIVESLSSSEESITRINKKIEKFQSEEQQQTEDELQDKIHPFAQTQSLVYPFPGPIPNSLPQNIPPLTQTPVVVPPFLQPEVMGVSKVKEAMAPKHKEMPFPKYPVEPFTESQSLTLTDVENLHLPLPLLQSWMHQPHQPLPPTVMFPPQSVLSLSQSKVLPVPQKAVPYPQRDMPIQAFLLYQEPVLGPVRGPFPIIV "

The percentage of D2O in the solution is "100"

The percentage of deuteration is "0"

The percentage of exchange is "90"

The concentration of the biomolecule in mg/ml is "0"

Total number of residues is 209.

('Chemical composition C1080', 'N', 268, 'O', 305, 'S', 6, 'H', 1643.5, 'P', 0)

('The molecular weight of the biomolecule is 23.826', 'kDa')

The Scattering length is 812.63 x10-4Å

The scattering length density (ρm) of the molecule is 2.768 x10-6Å-2

The scattering length denisty (ρs) of the solvent is 6.376 x10-6Å-2

The molecular volume is 29353.0 Å3

The number of exchangable hydrogens at pH 7.0 is 311.5

The density of the biomolecule is 1.35 g/ml

The estimate of intensity at zero angle I(0) of the biomolecule is 0 cm-1

smk78 commented 3 years ago

Here are the files that make the current website. psldc.zip

pkienzle commented 3 years ago

I've updated the sld calculator to python 3 and recent scipy (see attached). I did a little bit of code formatting in the process, but left the generated html more or less intact. It could use some cleanup as well.

For periodictable/fasta.py I'm using values from DOI:10.1016/S0167-7306(08)60575-X. I haven't yet cross-checked against the values in this calculator.

bsldc_cgi.py.txt

pkienzle commented 5 months ago

Doing some checking of the BSLDC code, I see some surprises and don't want to blindly copy the details.

Overall the calculations done by periodictable and BSLDC are reasonable, it's just that they are using different molecules with different number of labile hydrogen.

BSLDC uses Perkins(1986). See manual for full reference.

periodictable (pt) uses Perkins(1988). DOI:10.1016/S0167-7306(08)60575-X](https://doi.org/10.1016/S0167-7306(08)60575-X)

periodictable missed adding an extra H + OH for each sequence, which is need to terminate the dangling connectors at the ends of the sequence. The BSLDC code adds these to the molecular weight, but doesn't account for their scattering lengths.

Most of the BSLDC calculation is embedded in the magic numbers for mw (molecular weight) and sl (scattering length) which are precomputed for each residue. I have code below to print the equivalent values for the four cases (deuterated/non-deuterated in H2O/D2O).

For RNA and DNA the masses and scattering lengths differ by a sodium atom. These appear in Perkins(1988). No idea what Perkins(1986) is using for its formula. For the amino acids the number of hydrogen and the number of labile hydrogen differ quite a bit between the two sources. I do not know which one to use.

Tryptophan in BSLDC is inconsistent. The molecular weights for D2O and H2O differ by one but it claims two labile hydrogen. The scattering lengths differ by 20.8 for the undeuterated case (as expected for 2 labile H) but 10.6 for the deuterated case (should be about 10.4 going from -3.7409 for 1-H to 6.674 for 2-H; I'm ignoring the .015% natural abundance of D in natural H). I'm guessing that this was a programming error, where the scattering length for deuterated T in H2O was computed with the wrong number of labile H.

For several residues in BSLDC the number added to hydro does not match the number added to H. For example Threonine. This probably doesn't matter because "hydro" isn't used, and the C, H, O, N totals are only used for reporting, not calculation, but it will lead to inconsistencies between reported values in periodictable and BSLDC. Verifying that this is the source of the casein discrepancy will be tedious.

periodictable used different values for residue volumes (Perkins 1988 rather than Perkins 1986).

I don't understand the BSLDC volume fraction calculation. I expect it should be independent of exchange percent since you weigh the sample after deuteration, but before adding to the solvent. I suspect it is effectively equivalent to mass fraction since we are using it to compute overall transmission and scattering. That way we don't have to guess how much the sample will swell with the addition of solvent.

# Code for precomputing H/D mass and scattering lengths
from periodictable import fasta, H, D, Na
def sum_b(f):
  total = 0
  for el, count in f.atoms.items():
    total += el.neutron.b_c * count
  return total

def gen_bsldc(codes):
    for code, mol in codes.items():
      f = mol.labile_formula
      HH = f.replace(H[1], H)
      HD = f.replace(H[1], D)
      DH = f.replace(H, D).replace(H[1], H)
      DD = f.replace(H, D).replace(H[1], D)
      print(f"=== {mol.name} vol={mol.cell_volume:.2f} ex={f.atoms.get(H[1],0)} ===")
      print(f"{code} mw HH:{HH.mass:6.1f} HD:{HD.mass:6.1f} DH:{DH.mass:6.1f} DD:{DD.mass:6.1f}")
      print(f"  sl HH:{sum_b(HH):6.1f} HD:{sum_b(HD):6.1f} DH:{sum_b(DH):6.1f} DD:{sum_b(DD):6.1f}")
#print(fasta.AMINO_ACID_CODES)
print(f"Na mass: {Na.mass:.1f}, b_c: {Na.neutron.b_c}")
print("!!! RNA")
gen_bsldc(fasta.RNA_BASES)
print("!!! DNA")
gen_bsldc(fasta.DNA_BASES)
print("!!! Amino acids")
gen_bsldc(fasta.AMINO_ACID_CODES)
pkienzle commented 5 months ago

More questions:

  1. When looking at the SLD of DNA the current code is only computing SLD for single strand. Should we have the option of double-stranded SLD, with the complement added at each sequence position?
  2. Should we be including the H and OH terminators in the structure? Or are they better considered to be part of the solvent, with a slightly smaller protein volume and slightly larger contrast in the unterminated structure.
smk78 commented 5 months ago

Initial comments by email from Luke Clifton (originator of the Protein SLD Calculator):

I’ll have to look into these as Dan [Daniel Myatt] not myself wrote the eventual code on the web [the present BSLDC website] as part of his job and he changed the way we did things from how I did it in my original code.

Re comment: "For RNA and DNA the masses and scattering lengths differ by a sodium atom."

I’ll need to check this as this is likely just an adding discrepancy as these two things will be calculated independently.

I’ll check Trp to see if there is a problem.

Volume fractions of residues come from crystal structure averages and are independent of labelling as they assume these are the same.

Re comment: "For the amino acids the number of hydrogen and the number of labile hydrogen differ quite a bit between the two sources. I do not know which one to use."

These should be different. One is total hydrogen the other is exchangeable. Pretty sure this was correct in the original code.

Would it be worth arranging a chat.

pkienzle commented 5 months ago

There are inconsistencies other than Trp.

For each of the residues in bsldc I compute the difference in molecular weight and the difference in scattering length divided by the number of labile hydrogens for the undeuterated (u) and the fully deuterated (d) cases.

I expect the mw ratios to be 1 and the sl ratios to be ≈10.4, but I get something quite different (see below).

I notice that sl u≈10.4 always. I think the underlying values come from the tables in the Perkins (1986).

Is there a reason that the values in the remaining columns are all over the place?

This does not address differences in the 1986 and 1988 chemical formulas for the residues, which sometimes differ in the total number of hydrogens and in the number of hydrogens which are labile.

    DNA:A ex=2.0   mw  u=1.00  d=1.00     sl  u=10.40  d=10.55
    DNA:T ex=1.0   mw  u=1.00  d=1.00     sl  u=10.40  d=10.70
    DNA:G ex=3.0   mw  u=1.00  d=1.00     sl  u=10.40  d=10.50
    DNA:C ex=2.0   mw  u=1.00  d=1.00     sl  u=10.40  d=10.50
    RNA:A ex=3.0   mw  u=1.00  d=1.00     sl  u=10.40  d=10.47
    RNA:U ex=2.0   mw  u=1.00  d=1.00     sl  u=10.40  d=10.50
    RNA:G ex=4.0   mw  u=1.00  d=1.00     sl  u=10.43  d=10.45
    RNA:C ex=3.0   mw  u=1.00  d=1.00     sl  u=10.40  d=10.47
Protein:A ex=1.0   mw  u=1.00  d=1.00     sl  u=10.41  d=10.48
Protein:C ex=2.0   mw  u=0.50  d=1.00     sl  u=10.42  d=8.58
Protein:D ex=1.0   mw  u=2.00  d=4.00     sl  u=10.41  d=14.22
Protein:E ex=1.0   mw  u=2.00  d=2.00     sl  u=10.41  d=15.13
Protein:F ex=1.0   mw  u=1.00  d=1.00     sl  u=10.41  d=10.49
Protein:G ex=1.0   mw  u=1.00  d=1.00     sl  u=10.41  d=10.45
Protein:H ex=1.5   mw  u=1.33  d=1.33     sl  u=10.41  d=11.77
Protein:I ex=1.0   mw  u=2.00  d=1.00     sl  u=10.41  d=10.49
Protein:K ex=4.0   mw  u=0.75  d=0.75     sl  u=10.42  d=9.50
Protein:L ex=1.0   mw  u=1.00  d=1.00     sl  u=10.41  d=10.49
Protein:M ex=1.0   mw  u=1.00  d=1.00     sl  u=10.41  d=10.53
Protein:N ex=3.0   mw  u=1.00  d=1.00     sl  u=10.17  d=10.65
Protein:Q ex=3.0   mw  u=0.33  d=1.00     sl  u=10.41  d=10.44
Protein:R ex=6.0   mw  u=0.83  d=0.83     sl  u=10.41  d=28.33
Protein:S ex=2.0   mw  u=1.00  d=1.00     sl  u=10.41  d=13.34
Protein:T ex=2.0   mw  u=1.00  d=2.00     sl  u=10.41  d=10.44
Protein:V ex=1.0   mw  u=1.00  d=1.00     sl  u=10.41  d=10.25
Protein:W ex=2.0   mw  u=0.50  d=0.50     sl  u=10.42  d=5.28
Protein:Y ex=2.0   mw  u=1.00  d=1.00     sl  u=10.41  d=3.06

Note: I see that sl u=10.17 for Protein:N, but that's because there is a transcription error in the BSLDC code (65.08 rather than 65.80 for sl_D2O)

pkienzle commented 5 months ago

ORSO calculator at https://slddb.esss.dk/slddb/bio_blender produces yet another result for casein (see here)

Compound protein Description protein - 209 residues Chemical Formula C₁₀₈₀H₁₃₇₀Hx₃₁₁.₅N₂₆₈O₃₁₀S₆ Density (g/cm³) 1.332683 Neutron nuclear SLD (10⁻⁶ Å⁻²) 1.77540 -0.00010i Cu-Kɑ e-density (rₑ/ų) 0.431076 -0.000934i Mo-Kɑ SLD e-density (rₑ/ų) 0.429898 -0.000177i Custom (8040 eV) e-density (rₑ/ų) 0.431076 -0.000934i


Their table of amino acid chemical formulas and unit volumes is here

pkienzle commented 4 months ago

Comments from a bio user re: differences in H between bsldc/ORSO and periodictable for Histidine and Cysteine:

When an amino acid is incorporated into the polypeptide chain, which occurs via a dehydration reaction, you lose an oxygen and 2 labile hydrogens. For periodictable.fasta, we want to consider these already incorporated into the polypeptide chain.

For example, for histidine, (wikipedia), the formula for the amino acid is C6H9N3O2 (in its uncharged form). Upon incorporation into the polypeptide chain, its formula is C6H7N3O. Two of these hydrogens (the ones attached to nitrogens) are labile. At neutral pH, approximately half the histidines are charged and half are not, so this is where the extra 0.5H could come in.

Histidine should then have 7.5H, with 2.5H being labile. So not the same as either of your sources [H6.5 with 1.5 labile in bsldc; H8 with 3 labile in periodictable], but closer to the second one [periodictable] (which apparently assumes the charged form of histidine [periodictable has Asp-, Glu-, Lys+, His+, and Arg+]).

Cysteine should have 5H with 2H being labile [as in bsldc]. (Formula C3H7NO2S for the bare amino acid, remove H2O upon incorporation, leaving C3H5NOS; the hydrogens attached to the S and the N are labile.) However, cysteine frequently forms disulfide bonds, in which case one of the labile H would be lost, leaving you with 4 total and 1 labile [as in periodictable]. I do not know what fraction of cysteines in nature remain unbonded. You could very reasonably split the difference (4.5H, 1.5 labile), or require people to specify if it is bonded (e.g. C for cysteine and Cb for bonded cysteine).

Clearly this is a matter for the larger neutron bio community to address. I'm leaving periodictable as is until a consensus is reached.

pkienzle commented 4 months ago

Regarding dna/rna volumes, the values in bsldc/ORSO are traceable to [1], using the rna volumes of ACG for dna, with a correction of 18.3 mL/mol to account for + PO₃ – H₂O in chain relative to the isolated neucleoside. I'm switching to these volumes for periodictable, though using the measured values for the deoxygenated forms of ACG for dna.

[1] V.A. Buckin, B.I. Kankiya, R.L. Kazaryan, Hydration of nucleosides in dilute aqueous solutions: Ultrasonic velocity and density measurements, Biophysical Chemistry, Volume 34, Issue 3, 1989, Pages 211-223, ISSN 0301-4622, https://doi.org/10.1016/0301-4622(89)80060-2.