biopython / biopython

Official git repository for Biopython (originally converted from CVS)
http://biopython.org/
Other
4.42k stars 1.76k forks source link

How to change feature caption to eithe "product" or "gene" when drawing plasmid diagram? #2838

Closed limin321 closed 4 years ago

limin321 commented 4 years ago

Setup

I am trying to draw a digram of a Ti plasmid genbank file using biopython, one of the code is to add features to empty feature set: The code is:

gd_feature_set.add_feature(feature, color=color, label=True, label_size=0.7)

The above code will produce a diagram with locus tag as feature caption. What I want is using "product" as feature caption, how should I modify the previous code. In the Biopython book, it said use name="my gene", this will cause all features to have the same caption as "my gene". I want each feature to use its "product " as caption.

My second question is, what if I want to use "gene" name as feature caption. In this case, because not all CDS will be assigned a gene name since annotation won't be perfect, so if I want to use "gene" name as feature caption. How should I modify the previous code again?

Your help are greatly appreciated.

Best, Limin

(Please copy and run the above in your Python, and copy-and-paste the output)

Expected behaviour

(Please fill this in)

Actual behaviour

(Please fill this in, and provide any exception message in full)

Steps to reproduce

(Please fill this in )

peterjc commented 4 years ago

Probably you need to do something like this:

label = feature.qualifiers["product"][0]
gd_feature_set.add_feature(feature, color=color, label=True, label_size=0.7, name=label)

Or, using "gene" where not all the feature have this, thus using the dictionary .get(...) and setting a default of ["Unknown"]:

label = feature.qualifiers.get("gene", ["Unknown"])[0]
gd_feature_set.add_feature(feature, color=color, label=True, label_size=0.7, name=label)
limin321 commented 4 years ago
label = feature.qualifiers["product"][0]

Hi Thank you so much for your suggestion. I still have issues: when I run this code

for feature in record.features:
    if feature.type != "gene":
        continue
    if len(gd_feature_set) % 2 == 0:
        color = colors.blue
    else:
        color = colors.lightblue
    gd_feature_set.add_feature(feature, color = color, label = True)
gd_diagram.draw(format="linear", orientation="landscape", pagesize='A4', start=0, end=len(record))
gd_diagram.write("CC001_pTi_gene.pdf", "PDF")

The output looks like this:

Screen Shot 2020-04-30 at 2 40 44 PM

which means, if there is gene name, the caption is gene name, otherwise, it use locus-tag.

Then I add label = feature.qualifiers["product"][0] I run into the following error message

Screen Shot 2020-04-30 at 2 43 13 PM

However, when I add label = feature.qualifiers.get("gene",["Unknown"])[0] This wont change the output at all, I got the same result as the first one.

Screen Shot 2020-04-30 at 2 46 02 PM

Really confusing now.

Could you please help me clarify what is going on here? Thanks a lot.

limin321 commented 4 years ago

Probably you need to do something like this:

label = feature.qualifiers["product"][0]
gd_feature_set.add_feature(feature, color=color, label=True, label_size=0.7, name=label)

Or, using "gene" where not all the feature have this, thus using the dictionary .get(...) and setting a default of ["Unknown"]:

label = feature.qualifiers.get("gene", ["Unknown"])[0]
gd_feature_set.add_feature(feature, color=color, label=True, label_size=0.7, name=label)

I actually add these code, caption = feature.qualifiers.get("product",["Unknown"])[0] gd_feature_set.add_feature(feature, color = color, label = True, name = caption)

and the output looks like this

Screen Shot 2020-04-30 at 4 29 47 PM

All captions are marked as unknown. Could you please also help me with this? Really appreciate all your help.

Best,

peterjc commented 4 years ago

Evidently your GenBank file has few or no features with the "product" annotation key. Which file are you loading, or can you share a snippet of the feature table?

limin321 commented 4 years ago

Evidently your GenBank file has few or no features with the "product" annotation key. Which file are you loading, or can you share a snippet of the feature table?

Sure.

Screen Shot 2020-05-01 at 10 48 19 AM

Here is how the annotation looks like.

peterjc commented 4 years ago

This is subtle - these CDS features do have "product" annotations, so "product" should be in the feature.qualifiers dictionary. The mistake is that you are not drawing the CDS features, but the gene features (which have very little annotation).

P.S. Copy and pasting text is usually more helpful than a screen shot - especially with error messages. On github putting a triple back-tick line above and below the quoted text looks best (i.e. the ` symbol).

limin321 commented 4 years ago

This is subtle - these CDS features do have "product" annotations, so "product" should be in the feature.qualifiers dictionary. The mistake is that you are not drawing the CDS features, but the gene features (which have very little annotation).

P.S. Copy and pasting text is usually more helpful than a screen shot - especially with error messages. On github putting a triple back-tick line above and below the quoted text looks best (i.e. the ` symbol).

Hi Thanks for the advice, you help me figure out the why gene is not availabe, I didn't know this before. Now I changed it to 'CDS', still couldn't make it work. No matter how I modify the label, the output doesn't change acccordingly.

name = "C58_pTi"
gd_diagram = GenomeDiagram.Diagram(name)
gd_track_for_features = gd_diagram.new_track(1, name = "Annotated Features")
gd_feature_set = gd_track_for_features.new_set()

for feature in record.features:
    if feature.type != "CDS":
        continue
    if len(gd_feature_set) % 2 == 0:
        color = colors.blue
    else:
        color = colors.lightblue
    #label = feature.qualifiers["product"][0] 
    #label = feature.qualifiers.get("product",["Unknown"])[0] 
    gd_feature_set.add_feature(feature, color = color, label = True, label_size=4, label_angle= -60, label_position="middle",
                               sigil="ARROW", arrowshaft_height=0.3, arrowhead_length=0.1)
#gd_diagram.draw(format="linear", orientation="landscape", pagesize='A4', start=0, end=len(record))
gd_diagram.draw(format="circular", circular=True, pagesize=(20*cm, 20*cm), start=0, end=len(record), circle_core=0.7)
gd_diagram.write("C58_pTi.pdf", "PDF")

Run the above code, by default, it use gene name as caption. If no gene name, then it use locus tag. Then I tried both following, none of them change the output.

label = feature.qualifiers["product"][0]  
label = feature.qualifiers.get("gene", ["Unknown"])[0]

Could you please help me take a look at my problem again. Really appreciate your help .

peterjc commented 4 years ago

In the code above, you never use the text value set in variable label when you call gd_feature_set.add_feature(...), you need gd_feature_set.add_feature(..., name=label) as before.

limin321 commented 4 years ago

In the code above, you never use the text value set in variable label when you call gd_feature_set.add_feature(...), you need gd_feature_set.add_feature(..., name=label) as before.

Thank you so much. I am pretty new to python, you taught me a lot.

label = feature.qualifiers["product"][0]   

This one works now. It is able to label all features with product name. I have another question: feature.qualifiers["product"] does this mean you are extracting the a key name called "product" in qulifiers list? why you add [0] in the end? I tried to remove [0] in the end, it will produce an error message: AttributeError: 'list' object has no attribute 'strip'

label = feature.qualifiers.get("gene", ["Unknown"])[0]

This one is able to label features with gene names if gene names are available. If gene name not available, they all all labeled as "Unknown". Is it possible to tell python to label the unknown ones with their product name. I am thinking of a if-else statements, because my poor knowledge, I don't know how to do that. Could you please help me again?

Thank you so so so much. Stay safe.

peterjc commented 4 years ago

We need a Python string as the caption, and Python strings have a .strip(...) method. If you give some other object (like a list), then it does not have a .strip(...) method and that triggers the AttributeError.

Where did the list come from? This is down to a quirk of the GenBank format, where the feature annotation qualifiers come in key/value pairs. At first glance a Python dictionary with the values as strings would be perfect. The trouble is sometimes you get a key repeated more than once, so you'd need to merge the strings, or use a list of strings in those cases. We decided for consistency that the qualifier values would all be lists of strings, even if usually the list had just one entry (entry zero string Python arrays count from zero).

i.e. feature.qualifiers["product"] was a list (containing just one string), while feature.qualifiers["product"][0] was that string.

If you think having going though this that we can clarify anything in the main tutorial or other documentation, specific feedback would be very valuable.

peterjc commented 4 years ago

And to your follow up question, try:

product = feature.qualifiers["product"][0] 
# use gene name if present, fall back on product as default
label = feature.qualifiers.get("gene", [product])[0] 
limin321 commented 4 years ago

We need a Python string as the caption, and Python strings have a .strip(...) method. If you give some other object (like a list), then it does not have a .strip(...) method and that triggers the AttributeError.

Where did the list come from? This is down to a quirk of the GenBank format, where the feature annotation qualifiers come in key/value pairs. At first glance a Python dictionary with the values as strings would be perfect. The trouble is sometimes you get a key repeated more than once, so you'd need to merge the strings, or use a list of strings in those cases. We decided for consistency that the qualifier values would all be lists of strings, even if usually the list had just one entry (entry zero string Python arrays count from zero).

i.e. feature.qualifiers["product"] was a list (containing just one string), while feature.qualifiers["product"][0] was that string.

If you think having going though this that we can clarify anything in the main tutorial or other documentation, specific feedback would be very valuable.

Hi

You made things so making sense to me. My brain never had felt so clear for a month. If the tutorial include all this details like the way you explain to me, that will be really awesome. This does help people like me tremendously.

product = feature.qualifiers["product"][0] 
# use gene name if present, fall back on product as default
label = feature.qualifiers.get("gene", product)[0] 
# I actually tried getting rid of the [ ], it runs with no error message, but the output is very interesting. why need a [ ] for product variable. Kind of hard for me understanding this part.  I thought "gene" is also a key name of the qualifiers dictionary, but you use qualifiers.get()? May I ask why this is different from the way you access to "product".

By the way, could you please recommend more resource with more details for drawing genomediagram, like the way you explained to me. That will be really helpful and practical.

Best and stay safe.

limin321 commented 4 years ago

And to your follow up question, try:

product = feature.qualifiers["product"][0] 
# use gene name if present, fall back on product as default
label = feature.qualifiers.get("gene", [product])[0] 

These codes work perfectly. Thank you so much. Really appreciate you taught me so much.

Best and be safe.

peterjc commented 4 years ago

I'm glad to have helped. Maybe something based on your aim of using different preferences for the label text would be good as another example in the tutorial.

Looking at the code again, I see that the .name_qualifiers attribute is how @widdowquinn may have intended this to be done. This defaults to ["gene", "label", "name", "locus_tag", "product"] but the initialisation does not explicitly let you change it. I'd need to explore how best to change it, but you'd want it to be ["gene", "product"] instead.

limin321 commented 4 years ago

I'm glad to have helped. Maybe something based on your aim of using different preferences for the label text would be good as another example in the tutorial.

Looking at the code again, I see that the .name_qualifiers attribute is how @widdowquinn may have intended this to be done. This defaults to ["gene", "label", "name", "locus_tag", "product"] but the initialisation does not explicitly let you change it. I'd need to explore how best to change it, but you'd want it to be ["gene", "product"] instead.

Hi, I just read your comments again. still confused that .name_qualifiers defaults to ["gene", "label", "name", "locus_tag", "product"]. What does this mean? are those key names of a qualifiers list? Because when I print singe feature, like

record = SeqIO.read("C58_pTi.gb", "genbank")
print(record.features)
f1 = record.features[3]
print(f1)

the output looks like this: type: CDS location: 386:1502 qualifiers: Key: codon_start, Value: ['1'] Key: gene, Value: ['torf6'] Key: locus_tag, Value: ['Atu6000'] Key: old_locus_tag, Value: ['AGR_pTi_39'] Key: product, Value: ['agrocinopine synthase'] Key: protein_id, Value: ['AAK90959.2'] Key: transl_table, Value: ['11'] Key: translation, Value: ['MYNLGKGVQECTETSLLTALGGMERRNLSELDTTLTSDNLPVVSHDFNTWRISTLPDRLIRELHSADVKNIPVIIREVSNGEITESYTVTNDIIPFLEPLLDKVFDVNPGATFFLDGRDFEAHIITAWLSRRPKYRQRVVLLFYTFEYSSGEAFFDAVKQVSPEPDWRETVALMPVIFPQELPRLAKLIDLSDIAVEDLYEGGKIWIDSMLRQPMRVVAVHIVMARVKPEQLGLCDKPEIVRAFNADYAAVRLAYYVKDNPEVRKMRPHLKLATATRCYDFSAQLENGEKAEFSINIRTGRPMPRETDERKYIRRGYGIPGNAAAIADWVISDRSEDEMSFWEWRNKGVPRRVDHHCPHLDVIEQDGKSVP']

I am still learning how to access qualifiers keys and values. Thanks a lot.

peterjc commented 4 years ago

Yes, it is the default list of qualifier keys which GenomeDiagram will try when you let it work out the drawing caption on its own. In this case there is a "gene" entry so that is picked, the default caption would be "torf6".

limin321 commented 4 years ago

Yes, it is the default list of qualifier keys which GenomeDiagram will try when you let it work out the drawing caption on its own. In this case there is a "gene" entry so that is picked, the default caption would be "torf6".

Thank you so much. That is very helpful explaination. May I have another question about drawing diagram. I have some region from different bacterial, and trying to align them for comparing. is that possible to give the same gene/CDS from different strains the same color? I don't really see the exact codes in biopython book, the example there is manually given each CDS a color. In my case, some times I don't really know how many CDS was there, so I want to label the same CDS with the same color. Could you please teach me how to write that codes? Thanks a lot. Here is my codes for the alignment:

from Bio import SeqIO
from Bio.Graphics import GenomeDiagram
from reportlab.lib import colors
from reportlab.lib.units import cm
from Bio.SeqFeature import SeqFeature, FeatureLocation

fname_list = ['Gle002', 'Tul003','CL001','Sta004', 'Sta005' ]

name = "pTi_DNA"
gd_diagram=GenomeDiagram.Diagram(name)
max_len=0
for f in fname_list:
    record = SeqIO.read(f+"_TD.gb", "genbank")
    max_len=max(max_len, len(record))
    gd_track_for_features=gd_diagram.new_track(1, name =f,greytrack=True, start=0, end=len(record))
    gd_feature_set=gd_track_for_features.new_set()
# the following for loop add features
    i=0
    for feature in record.features:
        if feature.type !="CDS":
            continue
        if len(gd_feature_set)%2==0:
            color=colors.blue
        else:
            color=colors.lightblue
        i += 1
        product = feature.qualifiers["product"][0]
        label = feature.qualifiers.get("gene", [product])[0] + str(i)
        gd_feature_set.add_feature(feature, sigil="ARROW", color=color, label=True, name=label, label_size=2,label_angle=0)

gd_diagram.draw(format="linear", pagesize='A4',fragments=1,start=0,end=max_len)
gd_diagram.write(name+'.pdf','pdf')

Here is the alignment:

Screen Shot 2020-05-12 at 8 04 27 PM

If the same CDS or gene could be colored the same, that will be really nice. Thank you so much.

Best and be safe.

peterjc commented 4 years ago

That new question is much more complicated - how will you define the "same" gene? Do they use the same gene name? Or might you look at reciprocal best blast hits or something? You might find these examples useful https://www.open-bio.org/2012/03/02/cross-links-in-genomediagram/ and https://armchairbiology.blogspot.com/2012/09/the-colours-man-colours.html - Good luck!

limin321 commented 4 years ago

That new question is much more complicated - how will you define the "same" gene? Do they use the same gene name? Or might you look at reciprocal best blast hits or something? You might find these examples useful https://www.open-bio.org/2012/03/02/cross-links-in-genomediagram/ and https://armchairbiology.blogspot.com/2012/09/the-colours-man-colours.html - Good luck!

Thank you so much. I will try the two links you shared. Best,