Open Boehmin opened 10 months ago
Thanks Ines, we'll try to fix the biopython thing in the next update. Can you please post your matplotlib error? Thx! Marco
Hi Marco,
the matplotlib error was
ModuleNotFoundError: No module named 'matplotlib'
I just installed matplotlib and it worked fine.
I now managed to run through the whole notebook (I can post these issues separately if preferred/or change the title of this issue? but here summed up for now):
I had the same pandas error as here. I freshly pulled from git yesterday. First I tried to change .append
to pd.concat
in probedesign.py which did not work obviously. I forced pandas==1.1.5 which fixed it as suggested in the linked issue.
Since I still had to input the cutadapt file path, it took me a while to realise that it was loading the probedesign.py from the .egg module and changing the probedesign.py file would not do much. I had not worked with modules/.egg files before so it was quicker for me to do the following:
swap the following line
from PLP_directRNA_design import probedesign as plp
for
import sys
# Add the folder path to sys.path
sys.path.insert(0, '/user/pathto/PLP_directRNA_design')
import probedesign as plp
Not sure if there is an easier way (or if this might have broken things?).
Another issue I had was in the "Assign Genes to Barcode" section as I tested how="start" on=LbarID
. Since it wasn`t entirely clear to me whether to use the LbarID or a different variable/value, I tried all variations of "LbarID"/LbarID, numbers, column IDs, even barcode ID etc (maybe a full example could help?), until I figured out the issue was in the Lprobe_Ver2.csv. For clarity, this is the error I received:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/miniconda3/envs/probedesign/lib/python3.9/site-packages/pandas/core/indexes/base.py:3361, in Index.get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
File ~/miniconda3/envs/probedesign/lib/python3.9/site-packages/pandas/_libs/index.pyx:76, in pandas._libs.index.IndexEngine.get_loc()
File ~/miniconda3/envs/probedesign/lib/python3.9/site-packages/pandas/_libs/index.pyx:108, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'number'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[21], line 2
1 customizedlib=r"C:\Users\sergio.salas\Documents\PhD\projects\gene_design\hower_example_5\assigned_gene_LID.csv"
----> 2 probes=plp.build_plps(path,specific_seqs_final,L_probe_library,plp_length,how='start',on="Lbar_ID")
File ~/02-spatial_transcriptomics/01-dataanalysis/08-ISS_processing/PLP_directRNA_design/PLP_directRNA_design/probedesign.py:510, in build_plps(path, specific_seqs_final, L_probe_library, plp_length, how, on)
508 n=0
509 for g in gname:
--> 510 gene_names_ID = gene_names_ID.append({"gene": g, "idseq" : np.array(sbh.loc[sbh['number']==ID+n,'ID_Seq'])[0], "Lbar_ID" : str(np.array(sbh.loc[sbh['number']==ID+n,'Lbar_ID'])[0]), "AffyID" : np.array(sbh.loc[sbh['number']==ID+n,'L_Affy_ID'])[0] }, ignore_index=True)
511 n=n+1
512 gene_names_ID2=gene_names_ID.set_index("gene", drop = False)
File ~/miniconda3/envs/probedesign/lib/python3.9/site-packages/pandas/core/frame.py:3458, in DataFrame.__getitem__(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]
File ~/miniconda3/envs/probedesign/lib/python3.9/site-packages/pandas/core/indexes/base.py:3363, in Index.get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3365 if is_scalar(key) and isna(key) and not self.hasnans:
3366 raise KeyError(key)
KeyError: 'number'
I changed the values of the Lbar_ID rows "LbarID_0" in the .csv to a number (201>) and changed sbh['number']
to sbh['Lbar_ID']
such as below:
for g in gname:
gene_names_ID = gene_names_ID.append({"gene": g, "idseq" : np.array(sbh.loc[sbh['Lbar_ID']==ID+n,'ID_Seq'])[0], "Lbar_ID" : str(np.array(sbh.loc[sbh['Lbar_ID']==ID+n,'Lbar_ID'])[0]), "AffyID" : np.array(sbh.loc[sbh['Lbar_ID']==ID+n,'L_Affy_ID'])[0] }, ignore_index=True)
Now it runs.
Additionally, I was wondering if you ever had issues with duplicate sequences? I tried this pipeline with a random gene (Zfp85) and got 6 PLP sequences, 2 out of those are duplicates. Here the results attached: good_targetsfinal.csv designed_PLPs_final.csv
Thank you! Ines
Hi Ines, interesting, the pandas thing should have been fixed by a previous update. I'll double check. Same goes for the redundant probes, we had solved this in a previous version of our code, but somehow made it here. Give me a few days to go through the code. M.
Ok here I am.
This is for both @Boehmin and @Sverreg (commenting on the issue mentioned here).
You get duplicate sequence (or sequences within a +-20 nt range, which overlap and should be excluded, you can check this in the position column) likely because your gene is a bad substrate for the probes. Either it's too short or doesn't comply very well with our GC requirement.
When the search doesn't find a number of target equal or superior to the one you specified (default=5) it will automatically return all the targets.
Here's the relevant code snipped from the select_sequences
functions in PLP_design.py
if ele.shape[0]<number_of_selected:
selec=ele
else:
for num in range(0,number_of_selected):
if ele.shape[0]>0:
randomlist = random.sample(range(0, ele.shape[0]), 1)
sele=ele.iloc[randomlist,:]
try:
seleall=pd.concat([seleall,sele])
except:
seleall=sele
exclude=list(range(int(sele['Position']-20),int(sele['Position']+20)))
ele=ele.loc[~ele['Position'].isin(exclude),:]
selec=seleall
selected2=pd.concat([selected2,selec])
I'd suggest to run the search again for these genes relaxing a bit the GC content or taking out the requirement for a terminal G/C. Maybe that will fix the issue. Keep in mind that sometimes it's impossible to design "good"probes against some genes. You can try your chances anyway, design them manually and they might work... Please let me know if my explanation doesn't make sense.
Cheers, Marco
Hi Marco,
thank you for the explanation! I`ll keep that in mind and will give this a go. Just to check, would lowering the target requirement to =3 or 4 potentially also help?
Cheers, Ines
Hi Ines, regarding your last question. I think it's wise to have 5 probes (targets) per gene if possible. This will ensure high detection efficiency. While setting the target requirement to 3 or 4 will help solving the issue above, I'd still try to design 5 and do some manual check to remove duplicates and overlaps. Cheers and sorry for the late reply! Marco
I also just fixed the L-probe example file in a way that should now work consistently, and changed again the pandas requirements (turns out I had changed that in a test branch and not merged the changes to the main branch). Please keep finding bugs! Marco
I haven't changed the Biopython requirements, as you really get a warning rather than an error. So I'll try to fix it with a bit more time in a later release :)
Hi Marco,
thanks for the tip. I`ll test this again. Maybe you can answer another question. I am trying to design probes that are as species-agnostic as possible between mouse and human (following the description in the pre-print). Do I understand correctly that I should:
I`m also slowly going through the ISS analysis notebooks, so I will start a separate issue should I run into bugs there. :)
Hi @mgcizzu ,
one more question, is the anchor sequence in the notebook the primer sequence? Since you use a pseudo-anchor as described in the supplementary, I assume the anchor in this notebook is the complementary sequence of your RCA primer? Thank you!
I just freshly set up the environment and the version of biopython installed was 1.83 where GC has changed to gc_fraction (since 1.80) - I received the below error and had to force install biopython 1.79 which helped resolve the issue. It did however tell me matplotlib was not available (which might also be unrelated?).